Quantcast
Channel: Planet PostgreSQL
Viewing all 9649 articles
Browse latest View live

Daniel Vérité: OIDs demoted to normal columns: a glance at the past

$
0
0

In PostgreSQL 12, oid columns in system tables will loose their “special” nature, and the optional clause WITH OIDS will disapppear from CREATE TABLE. As a concrete consequence, oid will now be visible when running select * from the catalogs that have OIDs, as well as when querying information_schema.columns, or with \d inside psql. Until now they were hidden, as are all system columns such as xmin or xmax.

The commit message in the source repository mentions this reason for the change:

author Andres Freund <andres (at) anarazel (dot) de>
Wed, 21 Nov 2018 01:36:57 +0200 (15:36 -0800)
[…]
Remove WITH OIDS support, change oid catalog column visibility.
[…]
The fact that the oid column was not an ordinary column necessitated a significant amount of special case code to support oid columns. That already was painful for the existing, but upcoming work aiming to make table storage pluggable, would have required expanding and duplicating that “specialness” significantly.

Pluggable storage is a step towards the much expected zheap, as well as other formats in the future.

Looking back years ago, this can be seen in the perspective of previous changes, which also go in the direction of obsoleting OIDs:

  • 7.2 (Feb 2002), the oid column becomes optional.
  • 8.0 (Jan 2005), the default_with_oids parameter is created.
  • 8.1 (Nov 2005), default_with_oids is now false by default.

But why was the OID as a special column invented in the first place? Originally, as the name “Object ID” suggests, the OID is related to object orientation.

A bit of history: object orientation in Postgres

In the mid-80’s, the object orientation concept was surfacing, with OO languages like C++ in their early stages of design. In the database world, there was also this idea that maybe in the future, people will want to look at their data primarily through OO lenses.

This can explain why in early versions of Postgres, when developed as a research project at the University of Berkeley, object orientation was a significant component.

In programming languages, the OO concept worked out quite well, for instance with the very successful C++ or Java. But in databases, the concept did not truly take off, and ended up limited to niche products.

When the community of developers took over Postgres in the mid-90’s with the goal of making it the powerful SQL engine that it became, they inherited a number of features clearly influenced by the object oriented paradigm, such as:

  • tables are classes.
  • rows in table are class instances.
  • a table may inherit its structure from a parent.
  • polymorphism in functions (through overloading) and operators.

As most of the other database communities, the Postgres developers did not find much interest in pursuing the OO vision after the mid-90’s. Instead they focused on other goals, such as improving performance, robustness and conformance to the still-evolving SQL standard.

Anyway, the idea that a row is considered to be a class instance pretty much implies the existence of a unique identifier beyond the user-defined columns, to differentiate one instance from another. If we compare with a programming language, where classes are instantiated in memory, a class instance can be distinguished from a copy by having a different address in memory, even if all the rest is identical. In a way, the OID in Postgres is like the address of the class instance, serialized into a form meant to go to disk.

The old documentation still online explains the “row as a class instance” point of view:

Concepts in PostgreSQL 6.4 (1998):

The fundamental notion in Postgres is that of a class, which is a named collection of object instances. Each instance has the same collection of named attributes, and each attribute is of a specific type. Furthermore, each instance has a permanent object identifier (OID) that is unique throughout the installation. Because SQL syntax refers to tables, we will use the terms table and class interchangeably. Likewise, an SQL row is an instance and SQL columns are attributes.

An example was given in Populating a Class with Instances:

The insert statement is used to populate a class with instances:
INSERT INTO weather VALUES (‘San Francisco’, 46, 50, 0.25, ‘11/27/1994’)
[…]
As previously discussed, classes are grouped into databases, and a collection of databases managed by a single postmaster process constitutes an installation or site.

After 7.1, released in 2001, the reference to classes has disappeared, and the documentation has replaced the “Creating a New Class” OO-tainted expression by the more mundane “Creating a New Table”.

In the catalogs, there are still remnants from this past, like the table of tables being named pg_class (but there is a view called pg_tables).

OIDs in modern PostgreSQL

OIDs as normal columns are still used in the catalogs as surrogate keys where such keys are needed. In PostgreSQL 12, 39 tables have a field named oid and 278 columns are of oid type (versus 39 and 274 in version 11)

postgres=# SELECT
 count(*) filter (where attname = 'oid') as "OID as name",
 count(*) filter (where atttypid = 'oid'::regtype) as "OID as type"
FROM pg_attribute JOIN pg_class ON (attrelid=oid) WHERE relkind='r';

 OID as name | OID as type 
-------------+-------------
          39 |         278

Also, OIDs are essential in managing large objects, that store binary contents with transparent segmentation, since the API exposes this object exclusively through their OIDs. The only user-visible change in v12 should be that pg_largeobject_metadata.oid is now directly visible, but a user doesn’t even need to query this table when using the API.

OIDs below 16384 are still reserved for the base system, in v12 as before.

The generator of values for OIDs is a counter at the cluster level, so that the values are sequentially distributed as if they came from a sequence shared by all databases.

That gives, for instance:

postgres=# create database db1;
CREATE DATABASE

postgres=# \lo_import .bashrc
lo_import 16404

postgres=# \c db1
You are now connected to database "db1" as user "daniel".

db1=# \lo_import .bashrc
lo_import 16405

This behavior with no duplicate of OIDs across databases, despite the fact that large objects in different databases are independant from each other, looks like a remnant from the time where each OID was “unique throughout the installation” as quoted above, in the 6.4 documentation.

This global unicity constraint has disappeared a long time ago, but the generator has kept this anti-collision behavior, such that collisions may only start to happen after a cycle of more than 4 billion values within the cluster (the OID counter being an unsigned 32-bit integer that goes back at 16384 when it reaches 2^32).

If you’re curious about the internals, go read the comments along the GetNewOidWithIndex() function to see how the code enforces unicity post-cycling, as well as the pg_nextoid SQL function that is intentionally not mentioned in the documentation.


Umair Shahid: Postgres is the coolest database – Reason #2: The License

Umair Shahid: Postgres is the coolest database – Reason #2: The License

$
0
0
Legal documents = SCARY!! That’s the typical equation, and it’s true – except when it comes to PostgreSQL. Let me explain… I have been told by both prospects and clients, that when they sit down to negotiate terms with Oracle, they are faced with more lawyers than they have engineers. No wonder one shudders at […]

Laurenz Albe: Triggers to enforce constraints

$
0
0
Motörhead singing about deferred constraint triggers
© Laurenz Albe 2019

 

Sometimes you want to enforce a condition on a table that cannot be implemented by a constraint. In such a case it is tempting to use triggers instead. This article describes how to do this and what to watch out for.

It will also familiarize you with the little-known PostgreSQL feature of “constraint triggers”.

A test case

Suppose we have a table of prisons and a table of prison guards:

CREATE SCHEMA jail_app;

CREATE TABLE jail_app.prison (
   prison_id   integer PRIMARY KEY,
   prison_name text    NOT NULL
);

INSERT INTO jail_app.prison (prison_id, prison_name) VALUES
   (1, 'Karlau'),
   (2, 'Stein');

CREATE TABLE jail_app.guard (
   guard_id   integer PRIMARY KEY,
   guard_name text    NOT NULL
);

INSERT INTO jail_app.guard (guard_id, guard_name) VALUES
   (41, 'Alice'),
   (42, 'Bob'),
   (43, 'Chris');

Then we have a junction table that stores which guard is on duty in which prison:

CREATE TABLE jail_app.on_duty (
   prison_id integer REFERENCES prison,
   guard_id  integer REFERENCES guard,
   PRIMARY KEY (prison_id, guard_id)
);

INSERT INTO jail_app.on_duty (prison_id, guard_id) VALUES
   (1, 41), (2, 42), (2, 43);

So Alice is on duty in Karlau, and Bob and Chris are on duty in Stein.

Naïve implementation of a constraint as trigger

As guards go on and off duty, rows are added to and deleted from on_duty. We want to establish a constraint that at least one guard has to be on duty in any given prison.

Unfortunately there is no way to write this as a normal database constraint (if you are tempted to write a CHECK constraint that counts the rows in the table, think again).

But it would be easy to write a BEFORE DELETE trigger that ensures the condition:

CREATE FUNCTION jail_app.checkout_trig() RETURNS trigger
   LANGUAGE plpgsql AS
$$BEGIN
   IF (SELECT count(*)
       FROM jail_app.on_duty
       WHERE prison_id = OLD.prison_id
      ) < 2
   THEN
      RAISE EXCEPTION 'sorry, you are the only guard on duty';
   END IF;

   RETURN OLD;
END;$$;

CREATE TRIGGER checkout_trig BEFORE DELETE ON jail_app.on_duty
   FOR EACH ROW EXECUTE PROCEDURE jail_app.checkout_trig();

But, as we will see in the next section, we made a crucial mistake here.

What is wrong with our trigger constraint?

Imagine Bob wants to go off duty.

The prison guard application runs a transaction like the following:

START TRANSACTION;

DELETE FROM jail_app.on_duty
WHERE guard_id = (SELECT guard_id
                  FROM jail_app.guard
                  WHERE guard_name = 'Bob');

COMMIT;

Now if Chris happens to have the same idea at the same time, the following could happen (the highlighted lines form a second, concurrent transaction):

START TRANSACTION;

DELETE FROM jail_app.on_duty
WHERE guard_id = (SELECT guard_id
                  FROM jail_app.guard
                  WHERE guard_name = 'Bob');

          START TRANSACTION;

          DELETE FROM jail_app.on_duty
          WHERE guard_id = (SELECT guard_id
                            FROM jail_app.guard
                            WHERE guard_name = 'Chris');

          COMMIT;

COMMIT;

Now the first transaction has not yet committed when the second UPDATE runs, so the trigger function running in the second transaction cannot see the effects of the first update. That means that the second transaction succeeds, both guards go off duty, and the prisoners can escape.

You may think that this is a rare occurrence and you can get by ignoring that race condition in your application. But don’t forget there are bad people out there, and they may attack your application using exactly such a race condition (in the recent fad of picking impressive names for security flaws, this has been called an ACIDRain attack).

Do normal constraints have the same problem?

Given the above, you may wonder if regular constraints are subject to the same problem. After all, this is a consequence of PostgreSQL’s multi-version concurrency control (MVCC).

When checking constraints, PostgreSQL also checks rows that would normally not be visible to the current transaction. This is against the normal MVCC rules, but guarantees that constraints are not vulnerable to this race condition.

You could potentially do the same if you write a trigger function in C, but few people are ready to do that. With trigger functions written in any other language, you have no way to “peek” at uncommitted data.

Solving the problem with “pessimistic locking”

We can avoid the race condition by explicitly locking the rows we check. This effectively serializes data modifications, so it reduces concurrency and hence performance.

Don’t consider locking the whole table, even if it seems a simpler solution.

Our trigger now becomes a little more complicated. We want to avoid deadlocks, so we will make sure that we always lock rows in the same order. For this we need a statement level trigger with a transition table (new since v10):

CREATE OR REPLACE FUNCTION jail_app.checkout_trig() RETURNS trigger
   LANGUAGE plpgsql AS
$$BEGIN
   IF EXISTS (
         WITH remaining AS (
            /* of the prisons where somebody went off duty,
               select those which have a guard left */
            SELECT on_duty.prison_id
            FROM jail_app.on_duty
               JOIN deleted
                  ON on_duty.prison_id = deleted.prison_id
            ORDER BY on_duty.prison_id, on_duty.guard_id
            /* lock those remaining entries */
            FOR UPDATE OF on_duty
         )
         SELECT prison_id FROM deleted
         EXCEPT
         SELECT prison_id FROM remaining
      )
   THEN
      RAISE EXCEPTION 'cannot leave a prison without guards';
   END IF;

   RETURN NULL;
END;$$;

DROP TRIGGER IF EXISTS checkout_trig ON jail_app.on_duty;

CREATE TRIGGER checkout_trig AFTER DELETE ON jail_app.on_duty
   REFERENCING OLD TABLE AS deleted
   FOR EACH STATEMENT EXECUTE PROCEDURE jail_app.checkout_trig();

This technique is called “pessimistic locking” since it expects that there will be concurrent transactions that “disturb” our processing. Such concurrent transactions are preemptively blocked. Pessimistic locking is a good strategy if conflicts are likely.

Solving the probem with “optimistic locking”

Different from pessimistic locking, “optimistic locking” does not actually lock the contended objects. Rather, it checks that no concurrent transaction has modified the data between the time we read them and the time we modify the database.

This improves concurrency, and we don’t have to change our original trigger definition. The down side is that we must be ready to repeat a transaction that failed because of concurrent data modifications.

The most convenient way to implement optimistic locking is to raise the transaction isolation level. In our case, REPEATABLE READ is not enough to prevent inconsistencies, and we’ll have to use SERIALIZABLE.

All transactions that access jail_app.on_duty must start like this:

START TRANSACTION ISOLATION LEVEL SERIALIZABLE;

Then PostgreSQL will make sure that concurrent transactions won’t succeed unless they are serializable. That means that the transactions can be ordered so that serial execution of the transactions in this order would produce the same result.

If PostgreSQL cannot guarantee this, it will terminate one of the transactions with

ERROR:  could not serialize access due to read/write dependencies
        among transactions
HINT:  The transaction might succeed if retried.

This is a serialization error (SQLSTATE 40001) and doesn’t mean that you did something wrong. Such errors are normal with isolation levels above READ COMMITTED and tell you to simply retry the transaction.

Optimistic locking is a good strategy if conflicts are expected to occur only rarely. Then you don’t have to pay the price of repeating the transaction too often.

It should be noted that SERIALIZABLE comes with a certain performance hit. This is because PostgreSQL has to maintain additional “predicate locks”. See the documentation for performance considerations.

What about these “constraint triggers”?

Finally, PostgreSQL has the option to create “constraint triggers” with CREATE CONSTRAINT TRIGGER. It sounds like such triggers could be used to avoid the race condition.

Constraint triggers respect the MVCC rules, so they cannot “peek” at uncommitted rows of concurrent transactions. But the trigger execution can be deferred to the end of the transaction. They also have an entry in the pg_constraint system catalog.

Note that constraint triggers have to be AFTER triggers FOR EACH ROW, so we will have to rewrite the trigger function a little:

CREATE OR REPLACE FUNCTION jail_app.checkout_trig() RETURNS trigger
   LANGUAGE plpgsql AS
$$BEGIN
   -- the deleted row is already gone in an AFTER trigger
   IF (SELECT count(*) FROM jail_app.on_duty
       WHERE prison_id = OLD.prison_id
      ) < 1
   THEN
      RAISE EXCEPTION 'sorry, you are the only guard on duty';
   END IF;

   RETURN OLD;
END;$$;

DROP TRIGGER IF EXISTS checkout_trig ON jail_app.on_duty;

CREATE CONSTRAINT TRIGGER checkout_trig
   AFTER DELETE ON jail_app.on_duty
   DEFERRABLE INITIALLY DEFERRED
   FOR EACH ROW EXECUTE PROCEDURE jail_app.checkout_trig();

By making the trigger INITIALLY DEFERRED, we tell PostgreSQL to check the condition at COMMIT time. This will reduce the window for the race condition a little, but the problem is still there. If concurrent transactions run the trigger function at the same time, they won’t see each other’s modifications.

If constraint triggers don’t live up to the promise in their name, why do they have that name? The answer is in the history of PostgreSQL: CREATE CONSTRAINT TRIGGER was originally used “under the hood” to create database constraints. Even though that is no more the case, the name has stuck. “Deferrable trigger” would be a better description.

Conclusion

If you don’t want to be vulnerable to race conditions with a trigger that enforces a constraint, use locking or higher isolation levels.

Constraint triggers are not a solution.

The post Triggers to enforce constraints appeared first on Cybertec.

Laurenz Albe: Triggers to enforce constraints

$
0
0
Motörhead singing about deferred constraint triggers
© Laurenz Albe 2019

 

Sometimes you want to enforce a condition on a table that cannot be implemented by a constraint. In such a case it is tempting to use triggers instead. This article describes how to do this and what to watch out for.

It will also familiarize you with the little-known PostgreSQL feature of “constraint triggers”.

A test case

Suppose we have a table of prisons and a table of prison guards:

CREATE SCHEMA jail_app;

CREATE TABLE jail_app.prison (
   prison_id   integer PRIMARY KEY,
   prison_name text    NOT NULL
);

INSERT INTO jail_app.prison (prison_id, prison_name) VALUES
   (1, 'Karlau'),
   (2, 'Stein');

CREATE TABLE jail_app.guard (
   guard_id   integer PRIMARY KEY,
   guard_name text    NOT NULL
);

INSERT INTO jail_app.guard (guard_id, guard_name) VALUES
   (41, 'Alice'),
   (42, 'Bob'),
   (43, 'Chris');

Then we have a junction table that stores which guard is on duty in which prison:

CREATE TABLE jail_app.on_duty (
   prison_id integer REFERENCES prison,
   guard_id  integer REFERENCES guard,
   PRIMARY KEY (prison_id, guard_id)
);

INSERT INTO jail_app.on_duty (prison_id, guard_id) VALUES
   (1, 41), (2, 42), (2, 43);

So Alice is on duty in Karlau, and Bob and Chris are on duty in Stein.

Naïve implementation of a constraint as trigger

As guards go on and off duty, rows are added to and deleted from on_duty. We want to establish a constraint that at least one guard has to be on duty in any given prison.

Unfortunately there is no way to write this as a normal database constraint (if you are tempted to write a CHECK constraint that counts the rows in the table, think again).

But it would be easy to write a BEFORE DELETE trigger that ensures the condition:

CREATE FUNCTION jail_app.checkout_trig() RETURNS trigger
   LANGUAGE plpgsql AS
$$BEGIN
   IF (SELECT count(*)
       FROM jail_app.on_duty
       WHERE prison_id = OLD.prison_id
      ) < 2
   THEN
      RAISE EXCEPTION 'sorry, you are the only guard on duty';
   END IF;

   RETURN OLD;
END;$$;

CREATE TRIGGER checkout_trig BEFORE DELETE ON jail_app.on_duty
   FOR EACH ROW EXECUTE PROCEDURE jail_app.checkout_trig();

But, as we will see in the next section, we made a crucial mistake here.

What is wrong with our trigger constraint?

Imagine Bob wants to go off duty.

The prison guard application runs a transaction like the following:

START TRANSACTION;

DELETE FROM jail_app.on_duty
WHERE guard_id = (SELECT guard_id
                  FROM jail_app.guard
                  WHERE guard_name = 'Bob');

COMMIT;

Now if Chris happens to have the same idea at the same time, the following could happen (the highlighted lines form a second, concurrent transaction):

START TRANSACTION;

DELETE FROM jail_app.on_duty
WHERE guard_id = (SELECT guard_id
                  FROM jail_app.guard
                  WHERE guard_name = 'Bob');

          START TRANSACTION;

          DELETE FROM jail_app.on_duty
          WHERE guard_id = (SELECT guard_id
                            FROM jail_app.guard
                            WHERE guard_name = 'Chris');

          COMMIT;

COMMIT;

Now the first transaction has not yet committed when the second UPDATE runs, so the trigger function running in the second transaction cannot see the effects of the first update. That means that the second transaction succeeds, both guards go off duty, and the prisoners can escape.

You may think that this is a rare occurrence and you can get by ignoring that race condition in your application. But don’t forget there are bad people out there, and they may attack your application using exactly such a race condition (in the recent fad of picking impressive names for security flaws, this has been called an ACIDRain attack).

Do normal constraints have the same problem?

Given the above, you may wonder if regular constraints are subject to the same problem. After all, this is a consequence of PostgreSQL’s multi-version concurrency control (MVCC).

When checking constraints, PostgreSQL also checks rows that would normally not be visible to the current transaction. This is against the normal MVCC rules, but guarantees that constraints are not vulnerable to this race condition.

You could potentially do the same if you write a trigger function in C, but few people are ready to do that. With trigger functions written in any other language, you have no way to “peek” at uncommitted data.

Solving the problem with “pessimistic locking”

We can avoid the race condition by explicitly locking the rows we check. This effectively serializes data modifications, so it reduces concurrency and hence performance.

Don’t consider locking the whole table, even if it seems a simpler solution.

Our trigger now becomes a little more complicated. We want to avoid deadlocks, so we will make sure that we always lock rows in the same order. For this we need a statement level trigger with a transition table (new since v10):

CREATE OR REPLACE FUNCTION jail_app.checkout_trig() RETURNS trigger
   LANGUAGE plpgsql AS
$$BEGIN
   IF EXISTS (
         WITH remaining AS (
            /* of the prisons where somebody went off duty,
               select those which have a guard left */
            SELECT on_duty.prison_id
            FROM jail_app.on_duty
               JOIN deleted
                  ON on_duty.prison_id = deleted.prison_id
            ORDER BY on_duty.prison_id, on_duty.guard_id
            /* lock those remaining entries */
            FOR UPDATE OF on_duty
         )
         SELECT prison_id FROM deleted
         EXCEPT
         SELECT prison_id FROM remaining
      )
   THEN
      RAISE EXCEPTION 'cannot leave a prison without guards';
   END IF;

   RETURN NULL;
END;$$;

DROP TRIGGER IF EXISTS checkout_trig ON jail_app.on_duty;

CREATE TRIGGER checkout_trig AFTER DELETE ON jail_app.on_duty
   REFERENCING OLD TABLE AS deleted
   FOR EACH STATEMENT EXECUTE PROCEDURE jail_app.checkout_trig();

This technique is called “pessimistic locking” since it expects that there will be concurrent transactions that “disturb” our processing. Such concurrent transactions are preemptively blocked. Pessimistic locking is a good strategy if conflicts are likely.

Solving the probem with “optimistic locking”

Different from pessimistic locking, “optimistic locking” does not actually lock the contended objects. Rather, it checks that no concurrent transaction has modified the data between the time we read them and the time we modify the database.

This improves concurrency, and we don’t have to change our original trigger definition. The down side is that we must be ready to repeat a transaction that failed because of concurrent data modifications.

The most convenient way to implement optimistic locking is to raise the transaction isolation level. In our case, REPEATABLE READ is not enough to prevent inconsistencies, and we’ll have to use SERIALIZABLE.

All transactions that access jail_app.on_duty must start like this:

START TRANSACTION ISOLATION LEVEL SERIALIZABLE;

Then PostgreSQL will make sure that concurrent transactions won’t succeed unless they are serializable. That means that the transactions can be ordered so that serial execution of the transactions in this order would produce the same result.

If PostgreSQL cannot guarantee this, it will terminate one of the transactions with

ERROR:  could not serialize access due to read/write dependencies
        among transactions
HINT:  The transaction might succeed if retried.

This is a serialization error (SQLSTATE 40001) and doesn’t mean that you did something wrong. Such errors are normal with isolation levels above READ COMMITTED and tell you to simply retry the transaction.

Optimistic locking is a good strategy if conflicts are expected to occur only rarely. Then you don’t have to pay the price of repeating the transaction too often.

It should be noted that SERIALIZABLE comes with a certain performance hit. This is because PostgreSQL has to maintain additional “predicate locks”. See the documentation for performance considerations.

What about these “constraint triggers”?

Finally, PostgreSQL has the option to create “constraint triggers” with CREATE CONSTRAINT TRIGGER. It sounds like such triggers could be used to avoid the race condition.

Constraint triggers respect the MVCC rules, so they cannot “peek” at uncommitted rows of concurrent transactions. But the trigger execution can be deferred to the end of the transaction. They also have an entry in the pg_constraint system catalog.

Note that constraint triggers have to be AFTER triggers FOR EACH ROW, so we will have to rewrite the trigger function a little:

CREATE OR REPLACE FUNCTION jail_app.checkout_trig() RETURNS trigger
   LANGUAGE plpgsql AS
$$BEGIN
   -- the deleted row is already gone in an AFTER trigger
   IF (SELECT count(*) FROM jail_app.on_duty
       WHERE prison_id = OLD.prison_id
      ) < 1
   THEN
      RAISE EXCEPTION 'sorry, you are the only guard on duty';
   END IF;

   RETURN OLD;
END;$$;

DROP TRIGGER IF EXISTS checkout_trig ON jail_app.on_duty;

CREATE CONSTRAINT TRIGGER checkout_trig
   AFTER DELETE ON jail_app.on_duty
   DEFERRABLE INITIALLY DEFERRED
   FOR EACH ROW EXECUTE PROCEDURE jail_app.checkout_trig();

By making the trigger INITIALLY DEFERRED, we tell PostgreSQL to check the condition at COMMIT time. This will reduce the window for the race condition a little, but the problem is still there. If concurrent transactions run the trigger function at the same time, they won’t see each other’s modifications.

If constraint triggers don’t live up to the promise in their name, why do they have that name? The answer is in the history of PostgreSQL: CREATE CONSTRAINT TRIGGER was originally used “under the hood” to create database constraints. Even though that is no more the case, the name has stuck. “Deferrable trigger” would be a better description.

Conclusion

If you don’t want to be vulnerable to race conditions with a trigger that enforces a constraint, use locking or higher isolation levels.

Constraint triggers are not a solution.

The post Triggers to enforce constraints appeared first on Cybertec.

Sebastian Insausti: An Overview of Streaming Replication for TimescaleDB

$
0
0

Nowadays, replication is a given in a high availability and fault tolerant environment for pretty much any database technology that you’re using. It is a topic that we have seen over and over again, but that never gets old.

If you’re using TimescaleDB, the most common type of replication is streaming replication, but how does it work?

In this blog, we are going to review some concepts related to replication and we’ll focus on streaming replication for TimescaleDB, which is a functionality inherited from the underlying PostgreSQL engine. Then, we’ll see how ClusterControl can help us to configure it.

So, streaming replication is based on shipping the WAL records and having them applied to the standby server. So, first, let’s see what WAL is.

WAL

Write Ahead Log (WAL) is a standard method for ensuring data integrity, it is automatically enabled by default.

The WALs are the REDO logs in TimescaleDB. But, what are the REDO logs?

REDO logs contain all changes that were made in the database and they are used by replication, recovery, online backup and point in time recovery (PITR). Any changes that have not been applied to the data pages can be redone from the REDO logs.

Using WAL results in a significantly reduced number of disk writes, because only the log file needs to be flushed to disk to guarantee that a transaction is committed, rather than every data file changed by the transaction.

A WAL record will specify, bit by bit, the changes made to the data. Each WAL record will be appended into a WAL file. The insert position is a Log Sequence Number (LSN) that is a byte offset into the logs, increasing with each new record.

The WALs are stored in the pg_wal directory, under the data directory. These files have a default size of 16MB (the size can be changed by altering the --with-wal-segsize configure option when building the server). They have a unique incremental name, in the following format: "00000001 00000000 00000000".

The number of WAL files contained in pg_wal will depend on the value assigned to the min_wal_size and max_wal_size parameters in the postgresql.conf configuration file.

One parameter that we need to setup when configuring all our TimescaleDB installations is the wal_level. It determines how much information is written to the WAL. The default value is minimal, which writes only the information needed to recover from a crash or immediate shutdown. Archive adds logging required for WAL archiving; hot_standby further adds information required to run read-only queries on a standby server; and, finally logical adds information necessary to support logical decoding. This parameter requires a restart, so, it can be hard to change on running production databases if we have forgotten that.

Streaming Replication

Streaming replication is based on the log shipping method. The WAL records are directly moved from one database server into another to be applied. We can say that it is a continuous PITR.

This transfer is performed in two different ways, by transferring WAL records one file (WAL segment) at a time (file-based log shipping) and by transferring WAL records (a WAL file is composed of WAL records) on the fly (record based log shipping), between a master server and one or several slave servers, without waiting for the WAL file to be filled.

In practice, a process called WAL receiver, running on the slave server, will connect to the master server using a TCP/IP connection. In the master server, another process exists, named WAL sender, and is in charge of sending the WAL registries to the slave server as they happen.

Streaming replication can be represented as following:

By looking at the above diagram we can think, what happens when the communication between the WAL sender and the WAL receiver fails?

When configuring streaming replication, we have the option to enable WAL archiving.

This step is actually not mandatory, but is extremely important for robust replication setup, as it is necessary to avoid the main server to recycle old WAL files that have not yet being applied to the slave. If this occurs we will need to recreate the replica from scratch.

When configuring replication with continuous archiving, we are starting from a backup and, to reach the on sync state with the master, we need to apply all the changes hosted in the WAL that happened after the backup. During this process, the standby will first restore all the WAL available in the archive location (done by calling restore_command). The restore_command will fail when we reach the last archived WAL record, so after that, the standby is going to look on the pg_wal directory to see if the change exists there (this is actually made to avoid data loss when the master servers crashes and some changes that have already been moved into the replica and applied there have not yet been archived).

If that fails, and the requested record does not exist there, then it will start communicating with the master through streaming replication.

Whenever streaming replication fails, it will go back to step 1 and restore the records from archive again. This loop of retrieving from the archive, pg_wal, and via streaming replication goes on until the server is stopped or failover is triggered by a trigger file.

This will be a diagram of such configuration:

Streaming replication is asynchronous by default, so at some given moment we can have some transactions that can be committed in the master and not yet replicated into the standby server. This implies some potential data loss.

However, this delay between the commit and impact of the changes in the replica is supposed to be really small (some milliseconds), assuming of course that the replica server is powerful enough to keep up with the load.

For the cases when even the risk of a small data loss is not tolerable, we can use the synchronous replication feature.

In synchronous replication, each commit of a write transaction will wait until confirmation is received that the commit has been written to the write-ahead log on disk of both the primary and standby server.

This method minimizes the possibility of data loss, as for that to happen we will need for both the master and the standby to fail at the same time.

The obvious downside of this configuration is that the response time for each write transaction increases, as we need to wait until all parties have responded. So the time for a commit is, at minimum, the round trip between the master and the replica. Read-only transactions will not be affected by that.

To setup synchronous replication we need for each of the standby servers to specify an application_name in the primary_conninfo of the recovery.conf file: primary_conninfo = '...aplication_name=slaveX' .

We also need to specify the list of the standby servers that are going to take part in the synchronous replication: synchronous_standby_name = 'slaveX,slaveY'.

We can setup one or several synchronous servers, and this parameter also specifies which method (FIRST and ANY) to choose synchronous standbys from the listed ones.

To deploy TimescaleDB with streaming replication setups (synchronous or asynchronous), we can use ClusterControl, as we can see here.

After we have configured our replication, and it is up and running, we will need to have some additional features for monitoring and backup management. ClusterControl allows us to monitor and manage backups/retention of our TimescaleDB cluster from the same place without any external tool.

ClusterControl
Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

How to Configure Streaming Replication on TimescaleDB

Setting up streaming replication is a task that requires some steps to be followed thoroughly. If you want to configure it manually, you can follow our blog about this topic.

However, you can deploy or import your current TimescaleDB on ClusterControl, and then, you can configure streaming replication with a few clicks. Let’s see how can we do it.

For this task, we’ll assume you have your TimescaleDB cluster managed by ClusterControl. Go to ClusterControl -> Select Cluster -> Cluster Actions -> Add Replication Slave.

We can create a new replication slave (standby) or we can import an existing one. In this case, we’ll create a new one.

Now, we must select the Master node, add the IP Address or hostname for the new standby server, and the database port. We can also specify if we want ClusterControl to install the software and if we want to configure synchronous or asynchronous streaming replication.

That’s all. We only need to wait until ClusterControl finishes the job. We can monitor the status from the Activity section.

After the job has finished, we should have the streaming replication configured and we can check the new topology in the ClusterControl Topology View section.

By using ClusterControl, you can also perform several management tasks on your TimescaleDB like backup, monitor and alert, automatic failover, add nodes, add load balancers, and even more.

Failover

As we could see, TimescaleDB uses a stream of write-ahead log (WAL) records to keep the standby databases synchronized. If the main server fails, the standby contains almost all of the data of the main server and can be quickly made the new master database server. This can be synchronous or asynchronous and can only be done for the entire database server.

To effectively ensure high availability, it is not enough to have a master-standby architecture. We also need to enable some automatic form of failover, so if something fails we can have the smallest possible delay in resuming normal functionality.

TimescaleDB does not include an automatic failover mechanism to identify failures on the master database and notify the slave to take ownership, so that will require a little bit of work on the DBA’s side. You will also have only one server working, so re-creation of the master-standby architecture needs to be done, so we get back to the same normal situation that we had before the issue.

ClusterControl includes an automatic failover feature for TimescaleDB to improve mean time to repair (MTTR) in your high availability environment. In case of failure, ClusterControl will promote the most advanced slave to master, and it’ll reconfigure the remaining slave(s) to connect to the new master. HAProxy can also be automatically deployed in order to offer a single database endpoint to applications, so they are not impacted by a change of the master server.

Limitations

We have some well-known limitations when using Streaming Replication:

  • We cannot replicate into a different version or architecture
  • We cannot change anything on the standby server
  • We do not have much granularity on what we can replicate

So, to overcome these limitations, we have the logical replication feature. To know more about this replication type, you can check the following blog.

Conclusion

A master-standby topology has many different usages like analytics, backup, high availability, failover. In any case, it’s necessary to understand how the streaming replication works on TimescaleDB. It’s also useful to have a system to manage all the cluster and to give you the possibility to create this topology in an easy way. In this blog, we saw how to achieve it by using ClusterControl, and we reviewed some basic concepts about streaming replication.

Álvaro Hernández: Having lunch with PostgreSQL, MongoDB and JSON

$
0
0

On a post titled “Postgres JSON, Developer Productivity, and The MongoDB Advantage”, Buzz Moschetti discussed about PostgreSQL’s handling of JSON and how (inconvenient) it is for developers, specially when compared to MongoDB. While the post is almost 18 months old, the principles described there have not changed, and I (mostly) respectfully disagree. Here is my opinion on the topic.

Let’s see what there is on today’s menu.

Cartoon: PostgreSQL and MongoDB entering into the Database Steakhouse, to eat some JSON

Small bites

SQL syntax and, indeed, the relational model as a whole are designed to work with single, scalar values which carry the same type from row to row, not rich shapes like JSON that can contain substructures and arrays and different elements from row to row.

If anything, SQL is about set operations on tuples, not scalar values. But, I get Buzz’s point, he probably meant “columns”. Yet still not correct. The SQL standard has had support for arrays as a column type since 1999! Including functions to access, construct or create arrays. PostgreSQL is actually more advanced, supporting multidimensional arrays, and even a set of key-value pairs with the hstore datatype (again: all that within a single column). On top of that, PostgreSQL also supports custom data types (which can also be row types or data structures) and combinations of all that. So not simple scalar values. And it obviously supports JSON (with the jsonb data type), which will be further discussed here.

And the extensions to SQL utilized by Postgres to manipulate JSON are proprietary, unknown to most SQL developers, and not supported or even recognized by most 3rd party SQL tools.

I’m not aware of MongoDB’s language being part of any standard, so we should assume Buzz’s comment about proprietary language applies to both MongoDB and PostgreSQL equally. Being that true, there are some important catches:

  • PostgreSQL’s proprietary syntax is only for accesing JSON data. The rest of the language, and data accessed via non JSON data types, is pure, standard SQL. None of MongoDB’s query language is standard.

  • It is ironic to mention the support by third party tools, when MongoDB’s query language still struggles to get third party support on areas like data warehousing and Business Intelligence. The world is a SQL-dominated ecosystem, which plays along with PostgreSQL very well. It is so ironic that even MongoDB’s first implementation of its (proprietary, commercial) BI connector was based on PostgreSQL.

  • PostgreSQL is currently working on also supporting the recently adopted JSON support within the SQL standard.

You cannot simply set aside the fact JSON does not support dates as a native type; you must do something, somewhere in your stack to accomodate for this so that a real date e.g. java.util.Date is used by your application. Letting the application itself handle the problem is a completely non-scalable architecture and dooms the system to bugs and reconciliation errors as individual applications make decisions about how to deal with the data.

I also prefer a richer data serialization format than JSON. Yet most people deal with JSONs directly, even when using MongoDB –rather than MongoDB’s BSON. In any case, a data type conversion like the one used as an example on Buzz’s post can be very easily done at query time:

select (content->>'cd')::timestamp from foo;
┌─────────────────────────┐
│        timestamp        │
├─────────────────────────┤
│ 2017-11-22 15:20:34.326 │
└─────────────────────────┘

Note: the timestamp contained in the JSON string above contains a timezone indication. It would have been better to cast it to a timestamptz . But since PostgreSQL would have converted that to a timezone based on your server’s local timezone, just for representational purposes, it may have caused some confusion –for the non versed reader in PostgreSQL advanced date and time management capabilities.

Moreover, JSON is typed. And actually, PostgreSQL provides support via the jsonb_typeof function to return the resolved datatypes:

select jsonb_typeof(content->'props') typeof_props, jsonb_typeof(content->'props'->'a') typeof_a from foo;
┌──────────────┬──────────┐
│ typeof_props │ typeof_a │
├──────────────┼──────────┤
│ object       │ number   │
└──────────────┴──────────┘

I believe Buzz’s statement quoted above is overly exaggerated. It is not doomsday to do some application-level data type enrichment. It is, at least, the very same problem the application needs to deal with loose schemas: managing absent keys, different versions of documents or different data types for the same document key. Even if they would come as strongly typed properties in BSON! So neither PostgreSQL nor MongoDB avoid this problem –when working with unstructured data.

Repository with all the source code relevant to this blog post, data used and README.md

Main course

Nearly all traditional Postgres interaction with applications – written in almost every language – works via O/JDBC drivers and these drivers do not have the capability to properly and precisely convert the JSON into a useful native type (such as a Map or List in Java or a Dictionary or List in Python).

Buzz goes on to say that “we have to manually parse the JSON in our application”. And I agree. But I don’t see it being a problem. Let’s show how simple it is to do it:

  • Add Gson parser to your pom.xml (or any other JSON parser, for that matter):
<dependency><groupId>com.google.code.gson</groupId><artifactId>gson</artifactId><version>2.8.5</version></dependency>
  • Create a native Java class to hold the parsed values:
classJsonContent{privateint a;privateint[] fn;@Overridepublic String toString(){return"JsonContent{ a="+ a +", fn="+ Arrays.toString(fn)+" }";}}
  • Parse the results:
String json = rs.getString(1);
Gson gson =new GsonBuilder().create();
JsonContent jsonContent = gson.fromJson(json, JsonContent.class);
System.out.println(jsonContent);

The output is the expected one:

JsonContent{ a=12, fn=[10, 20, 30] }

It doesn’t look to me like the end of the world. Buzz considers also as a problem the availability of different JSON parsers and their interoperability. About the former I see it more as an advantage; and about the latter, it’s a non-issue: after all, JSON is a spec!

Source code for the example above

Side order: polymorphism

Libraries like Gson also have very good support to parse arbitrary, polymorphic JSONs. But even so, it’s a rare case that your data shape is changing so dramatically from document to document. Because even when parsed correctly, your application still needs to deal with that polymorphism! Otherwise, just treating it as a simple string would be so much easier.

So how does a parser like Gson deal with unexpected or heavily changing JSON documents anyway? Just a few lines of code, not much dissimilar from Buzz’s parsing code for a polymorphic BSON document:

StringBuffer sb =new StringBuffer();while(rs.next()){
    String json = rs.getString(1);
    JsonParser parser =new JsonParser();
    JsonObject object = parser.parse(json).getAsJsonObject();for(Map.Entry<String, JsonElement> entry : object.entrySet()){
        walkMap(sb, entry.getKey(), entry.getValue());}}

System.out.println(sb.toString());

The output it produces is:

a: JsonNumber{value=12}
fn: JsonArray
	0: JsonNumber{value=10}
	1: JsonNumber{value=20}
	2: JsonNumber{value=30}

a: JsonNumber{value=5}
fn: JsonArray
	0: JsonString{value="mix"}
	1: JsonNumber{value=7.0}
	2: JsonString{value="2017-11-22"}
	3: JsonBoolean{value=true}
	4: JsonDocument
		x: JsonNumber{value=3.0}

Source code for the example above

What is noticeable here is that the example above, which produces an output quite similar to that of Buzz’s post, did not require to construct specific BSON constructs, and instead relied on plain, “old” JSON. While we might argue, again, that BSON provides a richer set of datatypes, it is of questionable applicability due to the verboseness of constructing BSON documents and the need to interact with pervasive, existing JSON documents.

Indeed, compare the insert we did on PostgreSQL to get the above output:

insertinto foo values (
        '{"props": {"a": 12, "fn": [10,20,30]}, "cd":"2017-11-22T15:20:34.326Z"}'
),(
        '{"props": {"a":5, "fn":["mix", 7.0, "2017-11-22", true, {"x":3.0} ]}}'
);

with the one proposed by Buzz, for MongoDB:

db.foo.insert([ {"props": {"a": NumberInt("12"),
                             "fn": [NumberDecimal("10"),NumberInt("20"),NumberLong("30")] },
                             "cd": new ISODate("2017-11-22T15:20:34.326Z") } },
                   {"props": {"a": NumberInt("5"),
                              "fn": ["mix", 7.0, new ISODate("2017-11-22"), true, {x:3.0}] } }
]);

I’d personally stick with the first one. It is clearly less verbose, and is interoperable JSON, not MongoDB’s proprietary BSON.

Second side order: versioned documents

Having solved also the polymorphic case, I would like to come back to more real use cases. What you will probably need to deal is not arbitrary JSON documents, but rather an evolving schema. Not only adding new, optional fields; but also even changing the type of existing ones. Would PostgreSQL and the JSON library be able to cope with that? Short answer: no problem.

They key here is to exploit another of the advantages of the relational schema: a mix of “standard” columns, with fixed data types, with the variable JSON. We may encode all the variability in the JSON, while reserving (at least) one column on the regular table to indicate the version of the accompanying document. For example, by creating the table foo2 as:

createtable foo2 (version integer, content jsonb);

With this help from PostgreSQL, it is possible to store versioned “schemas” of the (otherwise variable) JSON documents, like in:

insertinto foo2 values (
        1, '{"cd": "2017-11-22T15:20:34.326Z", "props": {"a": 12, "fn": [10, 20, 30]}}'
), (
        2, '{"cd": "2017-11-22T15:20:34.326Z", "props": {"a": "twelve", "fn": [10, 20, 30], "j": false}}'
);

Using Gson’s library ability to map a given document to a class, we just need to create a 2nd version of the class object that would support the second document shape:

publicclassJsonContentV2{private String a;privateint[] fn;privateboolean j;@Overridepublic String toString(){return"JsonContent{ a="+ a +", fn="+ Arrays.toString(fn)+", j="+ j +" }";}}

and parse instantiating one or the other class with a simple switch statement (or a visitor, or Java 12’s new switch syntax):

Gson gson =new GsonBuilder().create();while(rs.next()){int version = rs.getInt(1);
    String json = rs.getString(2);

    Object jsonContent =null;switch(version){case 1: jsonContent = gson.fromJson(json, JsonContent.class);break;case 2: jsonContent = gson.fromJson(json, JsonContentV2.class);break;}

    System.out.println(jsonContent);}

The result:

JsonContent{ a=12, fn=[10, 20, 30] }
JsonContent{ a=twelve, fn=[10, 20, 30], j=false }

Source code for the example above

Dessert: BSON inside PostgreSQL

So far, we have seen how to do strict typed JSON, polymorphic JSON and then versioned JSON to support several different evolutions of the document’s schemas, with JSON schema type inference. We have also seen how JSON has indeed data types, and PostgreSQL also enables to expose them via the jsonb_typeof function.

It is noteworthy that standard JSON types are a subset of BSON’s types. Mongo shell uses the SpiderMonkey JavaScript engine that supports only standard JSON types, the same used by PostgreSQL. To make mongo shell work with BSON format, BSON’s types are wrapped inside objects (like in NumberDecimal("10")). Taking that into account, the same mechanism could be applied in PostgreSQL by wrapping a value inside an object to indicate an extended type like { "a": { "type": "int", "value": "12" } }. With this we address Buzz’s concern that “this still does not answer the fundamental type fidelity problem: fn is an array of what, exactly?”. The whole JSON document would look like:

{ "a": { "type": "int", "value": 12 }, "fn": [ { "type": "decimal", "value": 10 }, { "type": "int", "value": 20 }, { "type": "long", "value": 30 } ] }

This is a bit verbose, but we could wrap all the logic inside PostgreSQL or Java using some helper functions or classes.

But… wait a minute. If MongoDB does this by wrapping BSON datatypes into standard JSON, PostgreSQL could surely do the same! Certainly, PostgreSQL could use BSON too, by storing it into standard PostgreSQL’s jsonb type! It would look like this:

select content from foo3;
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                      content                                                      │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ {"cd": {"$date": 1511364034326}, "props": {"a": 12, "fn": [{"$numberDecimal": "10"}, 20, {"$numberLong": "30"}]}} │
│ {"props": {"a": 5, "fn": ["mix", 7.0, {"$date": 1511308800000}, true, {"x": 3.0}]}}                               │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

But how do we query this? Very easily too. We can just add MongoDB’s driver as a dependency to our pom.xml and then use a code practically identical to the one used before, but parsing to BSON:

while(rs.next()){
    String json = rs.getString(1);

    BsonDocument bson = RawBsonDocument.parse(json);for(Map.Entry<String, BsonValue> entry : bson.entrySet()){
        walkMap(sb, entry.getKey(), entry.getValue());}}

which yields identical result to what Buzz did with MongoDB:

a: BsonInt32{value=BsonInt32{value=12}}
fn: BsonArray
	0: BsonDecimal128{value=BsonDecimal128{value=10}}
	1: BsonInt32{value=BsonInt32{value=20}}
	2: BsonInt64{value=BsonInt64{value=30}}

a: BsonInt32{value=BsonInt32{value=5}}
fn: BsonArray
	0: BsonString{value=BsonString{value='mix'}}
	1: BsonDouble{value=BsonDouble{value=7.0}}
	2: BsonDateTime{value=BsonDateTime{value=1511308800000}}
	3: BsonBoolean{value=BsonBoolean{value=true}}
	4: BsonDocument
		x: BsonDouble{value=BsonDouble{value=3.0}}

Source code for the example above

A Patxarán shot

In Spain, it is not unfrequent to end a good, long meal with a digestivo (“digestif”). One of my favorite ones and very typical Spanish is Patxarán. Let’s have a shot of Patxarán!

If you would like to have a Patxarán shot after a delicious Mediterranean meal, while at the same time enojoying great PostgreSQL conversations with PostgreSQL experts and peers, join us on the PostgreSQL Ibiza Conference. OnGres is a proud Cluster Level / Platinum sponsor of the conference.

PostgreSQL has the best of both worlds: strongly typed columns, with support for advanced schema validation, triggers and foreign keys; and loosely typed schemas with columns of JSON (in PostgreSQL land: jsonb) datatype. MongoDB has only got the latter (strictly speaking MongoDB has schema validation; but it is a very poor version of that, where the schema validation is not enforced on old data if validation changes, and does not support foreign keys or triggers).

What’s even more interesting is the discussion that Buzz goes into with regards to productivity. If anything, the more functionality the database provides, the more productive you will be. Otherwise, you end up re-inventing the wheel on the application side. And this is exactly what you need to end up doing with MongoDB in many cases:

  • Joins (despite the $lookup operator, you will end up doing, and even MongoDB encourages, application-side joins).
  • Advanced query capabilities (PostgreSQL’s SQL is arguably much more advanced than MongoDB query language).
  • Transactions. Yes, MongoDB 4.0 has transactions. But they are extremely limited, and only fulfill a very narrow set of use cases. And in the abscence (or impossibility) of using transactions, MongoDB defaults to READ UNCOMMITTED isolation level, which imposes heavy taxes on the developer. Follow our Twitter for future announcements about more detailed posts analyzing MongoDB features as compared to PostgreSQL features, including transactions.

All these factors, and the more robust and trustable PostgreSQL operations that it exhibits in production, make PostgreSQL, I truly believe, a much more productive database than MongoDB.

Hope you enjoyed your meal today.

Markus Winand: A Close Look at the Index Include Clause

$
0
0

Some database—namely Microsoft SQL Server, IBM Db2, and also PostgreSQL since release 11—offer an include clause in the create index statement. The introduction of this feature to PostgreSQL is the trigger for this long overdue explanation of the include clause.

Before going into the details, let’s start with a short recap on how (non-clustered) B-tree indexes work and what the all-mighty index-only scan is.

Recap: B-tree Indexes

To understand the include clause, you must first understand that using an index affects up to three layers of data structures:

  • The B-tree

  • The doubly linked list at the leaf node level of the B-tree

  • The table

The first two structures together form an index so they could be combined into a single item, i.e. the “B-tree index”. I prefer to keep them separate as they serve different needs and have a different impact on performance. Moreover, explaining the include clause requires making this distinction.

In the general case, the database software starts traversing the B-tree to find the first matching entry at the leaf node level (1). It then follows the doubly linked list until it has found all matching entries (2) and finally it fetches each of those matching entries from the table (3). Actually, the last two steps can be interleaved, but that is not relevant for understanding the general concept.

The following formulas give you a rough idea of how many read operations each of these steps needs. The sum of these three components is the total effort of an index access.0

  • The B-tree: log100(<rows in table>), often less than 5

  • The doubly linked list: <rows read from index> / 100

  • The table: <rows read from table>1

When loading a few rows, the B-tree makes the greatest contribution to the overall effort. As soon as you need to fetch just a handful of rows from the table, this step takes the lead. In either case—few or many rows—the doubly linked list is usually a minor factor because it stores rows with similar values next to each other so that a single read operation can fetch 100 or even more rows. The formula reflects this by the respective divisor.2


Note

If you are thinking “That’s why we have clustered indexes”, please read my article: Unreasonable Defaults: Primary Key as Clustering Key.


The most generic idea about optimization is to do less work to achieve the same goal. When it comes to index access, this means that the database software omits accessing a data structure if it doesn’t need any data from it.3

You can read more about the inner workings of B-tree indexes in Chapter 1, “Anatomy of an SQL Index of SQL Performance Explained.

Recap: Index-Only Scan

The index-only scan does exactly that: it omits the table access if the required data is available in the doubly linked list of the index.

Consider the following index and query I borrowed from Index-Only Scan: Avoiding Table Access.

CREATE INDEX idx
    ON sales
     ( subsidiary_id, eur_value )
SELECT SUM(eur_value)
  FROM sales
 WHERE subsidiary_id = ?

At first flance, you may wonder why the column eur_value is in the index definition at all—it is not mentioned in the where clause.


B-tree Indexes Help Many Clauses

It is a common misconception that indexes only help the where clause.

B-tree indexes can also help the order by, group by, select and other clauses. It is just the B-tree part of an index—not the doubly linked list—that cannot be used by other clauses.


The crucial point in this example is that the B-tree index happens to have all required columns—the database software doesn’t need to access the table itself. This is what we refer to as an index-only scan.

Applying the formulas above, the performance benefit of this is very small if only a few rows satisfy the where clause. On the other hand, if the where clause accepts many rows, e.g. millions, the number of read operations is essentially reduced by a factor of 100.


Note

It is not uncommon that an index-only scan improves performance by one or two orders of magnitude.


The example above uses the fact that the doubly-linked list—the leaf nodes of the B-tree—contains the eur_value column. Although the other nodes of the B-tree store that column too, this query has no use for the information in these nodes.

The Include Clause

The include clause allows us to make a distinction between columns we would like to have in the entire index (key columns) and columns we only need in the leaf nodes (include columns). That means it allows us to remove columns from the non-leaf nodes if we don’t need them there.

Using the include clause, we could refine the index for this query:

CREATE INDEX idx
    ON sales ( subsidiary_id )
     INCLUDE ( eur_value )

The query can still use this index for an index-only scan, thus yielding essentially the same performance.

Besides the obvious differences in the picture, there is also a more subtle difference: the order of the leaf node entries does not take the include columns into account. The index is solely ordered by its key columns.4 This has two consequences: include columns cannot be used to prevent sorting nor are they considered for uniqueness (see below).


“Covering Index”

The term “covering index” is sometimes used in the context of index-only scans or include clauses. As this term is often used with a different meaning, I generally avoid it.

What matters is whether a given index can support a given query by means of an index-only scan. Whether or not that index has an include clause or contains all table columns is not relevant.


Compared to the original index definition, the new definition with the include clause has some advantages:

  • The tree might have fewer levels (<~40%)

    As the tree nodes above the doubly linked list do not contain the include columns, the database can store more branches in each block so that the tree might have fewer levels.

  • The index is slightly smaller (<~3%)

    As the non-leaf nodes of the tree don’t contain include columns, the overall size of that index is slightly less. However, the leaf node level of the index needs the most space anyway so that the potential savings in the remaining nodes is very little.

  • It documents its purpose

    This is definitely the most underestimated benefit of the include clause: the reason why the column is in the index is document in the index definition itself.

Let me elaborate on the last item.

When extending an existing index, it is very important to know exactly why the index is currently defined the way it happens to be defined. The freedoms you have in changing the index without breaking any other queries is a direct result of this knowledge.

The following query demonstrates this:

SELECT *
  FROM sales
 WHERE subsidiary_id = ?
 ORDER BY ts DESC
 FETCH FIRST 1 ROW ONLY

As before, for a given subsidiary this query fetches the most recent sales entry (ts is for time stamp).

To optimize this query, it would be great to have an index that starts with the key columns (subsidiary_id, ts). With this index, the database software can directly navigate to the latest entry for that subsidiary and return it right away. There is no need to read and sort all of the entries for that subsidiary because the doubly linked list is sorted according to the index key, i.e. the last entry for any given subsidiary must have the greatest ts value for that subsidiary. With this approach, the query is essentially as fast as a primary key lookup. See Indexing Order By and Querying Top-N Rows for more details about this technique.

Before adding a new index for this query, we should check if there is an existing index that can be changed (extended) to support this trick. This is generally a good practice because extending an existing index has a smaller impact on the maintenance overhead than adding a new index. However, when changing an existing index, we need to make sure that we do not make that index less useful for other queries.

If we look at the original index definition, we encounter a problem:

CREATE INDEX idx
    ON sales
     ( subsidiary_id, eur_value )

To make this index support the order by clause of the above query, we would need to insert the ts column between the two existing columns:

CREATE INDEX idx
    ON sales
     ( subsidiary_id, ts, eur_value )

However, that might render this index less useful for queries that need the eur_value column in the second position, e.g. if it was in the where or order by clause. Changing this index involves a considerable risk: breaking other queries unless we know that there are no such queries. If we don’t know, it is often best to keep the index as it is and create another one for the new query.

The picture changes completely if we look at the index with the include clause.

CREATE INDEX idx
    ON sales ( subsidiary_id )
     INCLUDE ( eur_value )

As the eur_value column is in the include clause, it is not in the non-leaf nodes and thus neither useful for navigating the tree nor for ordering. Adding a new column to the end of the key part is relatively safe.

CREATE INDEX idx
    ON sales ( subsidiary_id, ts )
     INCLUDE ( eur_value )

Even though there is still a small risk of negative impacts for other queries, it is usually worth taking that risk.5

From the perspective of index evolution, it is thus very helpful to put columns into the include clause if this is all you need. Columns that are just added to enable an index-only scan are the prime candidates for this.

Filtering on Include Columns

Until now we have focused on how the include clause can enable index-only scans. Let’s also look at another case where it is beneficial to have an extra column in the index.

SELECT *
  FROM sales
 WHERE subsidiary_id = ?
   AND notes LIKE '%search term%'

I’ve made the search term a literal value to show the leading and trailing wildcards—of course you would use a bind parameter in your code.

Now, let’s think about the right index for this query. Obviously, the subsidiary_id needs to be in the first position. If we take the previous index from above, it already satisfies this requirement:

CREATE INDEX idx
    ON sales ( subsidiary_id, ts )
     INCLUDE ( eur_value )

The database software can use that index with the three-step procedure as described at the beginning: (1) it will use the B-tree to find the first index entry for the given subsidiary; (2) it will follow the doubly linked list to find all sales for that subsidiary; (3) it will fetch all related sales from the table, remove those for which the like pattern on the notes column doesn’t match and return the remaining rows.

The problem is the last step of this procedure: the table access loads rows without knowing if they will make it into the final result. Quite often, the table access is the biggest contributor to the total effort of running a query. Loading data that is not even selected is a huge performance no-no.


Important

Avoid loading data that doesn’t affect the result of the query.


The challenge with this particular query is that it uses an in-fix like pattern. Normal B-tree indexes don’t support searching such patterns. However, B-tree indexes still support filtering on such patterns. Note the emphasis: searching vs. filtering.

In other words, if the notes column was present in the doubly linked list, the database software could apply the like pattern before fetching that row from the table (not PostgreSQL, see below). This prevents the table access if the like pattern doesn’t match. If the table has more columns, there is still a table access to fetch those columns for the rows that satisfy the where clause—due to the select *.

CREATE INDEX idx
    ON sales ( subsidiary_id, ts )
     INCLUDE ( eur_value, notes )

If there are more columns in the table, the index does not enable an index-only scan. Nonetheless, it can bring the performance close to that of an index-only scan if the portion of rows that match the like pattern is very low. In the opposite case—if all rows match the pattern—the performance is a little bit worse due to the increased index size. However, the breakeven is easy to reach: for an overall performance improvement, it is often enough that the like filter removes a small percentage of the rows. Your mileage will vary depending on the size of the involved columns.

Unique Indexes with Include Clause

Last but not least there, is an entirely different aspect of the include clause: unique indexes with an include clause only consider the key columns for the uniqueness.

That allows us to create unique indexes that have additional columns in the leaf nodes, e.g. for an index-only scan.

CREATE UNIQUE INDEX …
    ON … ( id )
 INCLUDE ( payload )

This index protects against duplicate values in the id column,6 yet it supports an index-only scan for the next query.

SELECT payload
  FROM …
 WHERE id = ?

Note that the include clause is not strictly required for this behavior: databases that make a proper distinction between unique constraints and unique indexes just need an index with the unique key columns as the leftmost columns—additional columns are fine.

For the Oracle Database, the corresponding syntax is this:

CREATE INDEX …
    ON … ( id, payload )
ALTER TABLE … ADD UNIQUE ( id )
      USING INDEX …

Compatibility

Availability of INCLUDE

PostgreSQL: No Filtering Before Visibility Check

The PostgreSQL database has a limitation when it comes to applying filters on the index level. The short story is that it doesn’t do it, except in a few cases. Even worse, some of those cases only work when the respective data is stored in the key part of the index, not in the include clause. That means moving columns to the include clause may negatively affect performance, even if the above described logic still applies.

The long story starts with the fact that PostgreSQL keeps old row versions in the table until they become invisible to all transactions and the vacuum process removes them at some later point in time. To know whether a row version is visible (to a given transaction) or not, each table has two extra attributes that indicate when a row version was created and deleted: xmin and xmax. The row is only visible if the current transaction falls within the xmin/xmax range.7

Unfortunately, the xmin/xmax values are not stored in indexes.8

That means that whenever PostgreSQL is looking at an index entry, it cannot tell whether or not that entry is visible to the current transaction. It could be a deleted entry or an entry that has not yet been committed. The canonical way to find out is to look into the table and check the xmin/xmax values.

A consequence is that there is no such thing as an index-only scan in PostgreSQL. No matter how many columns you put into an index, PostgreSQL will always need to check the visibility, which is not available in the index.

Yet there is an Index Only Scan operation in PostgreSQL—but that still needs to check the visibility of each row version by accessing data outside the index. Instead of going to the table, the Index Only Scan first checks the so-called visibility map. This visibility map is very dense so the number of read operations is (hopefully) less than fetching xmin/xmax from the table. However, the visibility map does not always give a definite answer: the visibility map either states that that the row is known to be visible, or that the visibility is not known. In the latter case, the Index Only Scan still needs to fetch xmin/xmax from the table (shown as “Heap Fetches” in explain analyze).

After this short visibility digression, we can return to filtering on the index level.

SQL allows arbitrary complex expressions in the where clause. These expressions might also cause runtime errors such as “division by zero”. If PostgreSQL would evaluate such expression before confirming the visibility of the respective entry, even invisible rows could cause such errors. To prevent this, PostgreSQL generally checks the visibility before evaluating such expressions.

There is one exception to this general rule. As the visibility cannot be checked while searching an index, operators that can be used for searching must always be safe to use. These are the operators that are defined in the respective operator class. If a simple comparison filter uses an operation from such an operator class, PostgreSQL can apply that filter before checking the visibility because it knows that these operators are safe to use. The crux is that only key columns have an operator class associated with them. Columns in the include clause don’t—filters based on them are not applied before their visibility is confirmed. This is my understanding from a thread on the PostgreSQL hackers mailing list.

For a demonstration, take the previous index and query:

CREATE INDEX idx
    ON sales ( subsidiary_id, ts )
     INCLUDE ( eur_value, notes )
SELECT *
  FROM sales
 WHERE subsidiary_id = ?
   AND notes LIKE '%search term%'

The execution plan—edited for brevity—could look like this:

               QUERY PLAN
----------------------------------------------
Index Scan using idx on sales (actual rows=16)
  Index Cond: (subsidiary_id = 1)
  Filter: (notes ~~ '%search term%')
  Rows Removed by Filter: 240
  Buffers: shared hit=54

The like filter is shown in Filter, not in Index Cond. That means it was applied at table level. Also, the number of shared hits is rather high for fetching 16 rows.

In a Bitmap Index/Heap Scan the phenomenon becomes more obvious.

                  QUERY PLAN
-----------------------------------------------
Bitmap Heap Scan on sales (actual rows=16)
  Recheck Cond: (idsubsidiary_id= 1)
  Filter: (notes ~~ '%search term%')
  Rows Removed by Filter: 240
  Heap Blocks: exact=52
  Buffers: shared hit=54
  -> Bitmap Index Scan on idx (actual rows=256)
       Index Cond: (subsidiary_id = 1)
       Buffers: shared hit=2

The Bitmap Index Scan does not mention the like filter at all. Instead it returns 256 rows—way more than the 16 that satisfy the where clause.

Note that this is not a particularity of the include column in this case. Moving the include columns into the index key gives the same result.

CREATE INDEX idx
    ON sales ( subsidiary_id, ts, eur_value, notes)
                  QUERY PLAN
-----------------------------------------------
Bitmap Heap Scan on sales (actual rows=16)
  Recheck Cond: (subsidiary_id = 1)
  Filter: (notes ~~ '%search term%')
  Rows Removed by Filter: 240
  Heap Blocks: exact=52
  Buffers: shared hit=54
  -> Bitmap Index Scan on idx (actual rows=256)
       Index Cond: (subsidiary_id = 1)
       Buffers: shared hit=2

This is because the like operator is not part of the operator class so it is not considered to be safe.

If you use an operation from the operator class, e.g. equals, the execution plan changes.

SELECT *
  FROM sales
 WHERE subsidiary_id = ?
   AND notes = 'search term'

The Bitmap Index Scan now applies all conditions from the where clause and only passes the remaining 16 rows on to the Bitmap Heap Scan.

                 QUERY PLAN
----------------------------------------------
Bitmap Heap Scan on sales (actual rows=16)
  Recheck Cond: (subsidiary_id = 1
             AND notes = 'search term')
  Heap Blocks: exact=16
  Buffers: shared hit=18
  -> Bitmap Index Scan on idx (actual rows=16)
       Index Cond: (subsidiary_id = 1
                AND notes = 'search term')
       Buffers: shared hit=2

Note that this requires the respective column to be a key column. If you move the notes column back to the include clause, it has no associated operator class so the equals operator is not considered safe anymore. Consequently, PostgreSQL postpones applying this filter to the table access until after the visibility is checked.

                 QUERY PLAN
-----------------------------------------------
Bitmap Heap Scan on sales (actual rows=16)
  Recheck Cond: (id = 1)
  Filter: (notes = 'search term')
  Rows Removed by Filter: 240
  Heap Blocks: exact=52
  Buffers: shared hit=54
  -> Bitmap Index Scan on idx (actual rows=256)
       Index Cond: (id = 1)
       Buffers: shared hit=2

A Close Look at the Index Include Clause” by Markus Winand was originally published at Use The Index, Luke!.


Eric Hanson: "Aquameta Revisited" on FLOSS Weekly

$
0
0

Eric talks about Aquameta 0.2 and the advantages of migrating the web stack into PostgreSQL on This Week in Tech's FLOSS Weekly.

Hubert 'depesz' Lubaczewski: Waiting for PostgreSQL 12 – Add SETTINGS option to EXPLAIN, to print modified settings.

Hubert 'depesz' Lubaczewski: Waiting for PostgreSQL 12 – Allow VACUUM to be run with index cleanup disabled.

Ibrar Ahmed: Benchmark ClickHouse Database and clickhousedb_fdw

$
0
0
postgres clickhouse fdw

In this research,  I wanted to see what kind of performance improvements could be gained by using a ClickHouse data source rather than PostgreSQL. Assuming that I would see performance advantages with using ClickHouse, would those advantages be retained if I access ClickHouse from within postgres using a foreign data wrapper (FDW)? The FDW in question is clickhousedb_fdw – an open source project from Percona!

The database environments under scrutiny are PostgreSQL v11, clickhousedb_fdw and a ClickHouse database. Ultimately, from within PostgreSQL v11, we are going to issue various SQL queries routed through our clickhousedb_fdw to the ClickHouse database. Then we’ll see how the FDW performance compares with those same queries executed in native PostgreSQL and native ClickHouse.

Clickhouse Database

ClickHouse is an open source column based database management system which can achieve performance of between 100 and 1000 times faster than traditional database approaches, capable of processing more than a billion rows in less than a second.

Clickhousedb_fdw

clickhousedb_fdw, a ClickHouse database foreign data wrapper, or FDW, is an open source project from Percona. Here’s a link for the GitHub project repository:

https://github.com/Percona-Lab/clickhousedb_fdw

I wrote a blog in March which tells you more about our FDW: https://www.percona.com/blog/2019/03/29/postgresql-access-clickhouse-one-of-the-fastest-column-dbmss-with-clickhousedb_fdw/

As you’ll see, this provides for an FDW for ClickHouse that allows you to SELECT from, and INSERT INTO, a ClickHouse database from within a PostgreSQL v11 server.

The FDW supports advanced features such as aggregate pushdown and joins pushdown. These significantly improve performance by utilizing the remote server’s resources for these resource intensive operations.

Benchmark environment

  • Supermicro server:
    • Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz
    • 2 sockets / 28 cores / 56 threads
    • Memory: 256GB of RAM
    • Storage: Samsung  SM863 1.9TB Enterprise SSD
    • Filesystem: ext4/xfs
  • OS: Linux smblade01 4.15.0-42-generic #45~16.04.1-Ubuntu
  • PostgreSQL: version 11

Benchmark tests

Rather than using some machine generated dataset for this benchmark, we used the “On Time Reporting Carrier On-Time Performance” data from 1987 to 2018. You can access the data using our shell script available here:

https://github.com/Percona-Lab/ontime-airline-performance/blob/master/download.sh

The size of the database is 85GB, providing a single table of 109 columns.

Benchmark Queries

Here are the queries I used to benchmark the ClickHouse, clickhousedb_fdw, and PostgreSQL.

Q#Query Contains Aggregates and Group By
Q1SELECT DayOfWeek, count(*) AS c FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;
Q2SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;
Q3SELECT Origin, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY Origin ORDER BY c DESC LIMIT 10;
Q4SELECT Carrier, count(*) FROM ontime WHERE DepDelay>10 AND Year = 2007 GROUP BY Carrier ORDER BY count(*) DESC;
Q5SELECT a.Carrier, c, c2, c*1000/c2 as c3 FROM ( SELECT Carrier, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year=2007 GROUP BY Carrier ) a
INNER JOIN (  SELECT   Carrier,count(*) AS c2 FROM ontime WHERE Year=2007 GROUP BY Carrier)b on a.Carrier=b.Carrier ORDER BY c3 DESC;
Q6SELECT a.Carrier, c, c2, c*1000/c2 as c3 FROM ( SELECT Carrier,  count(*) AS c FROM ontime  WHERE DepDelay>10 AND Year >= 2000 AND
Year <= 2008 GROUP BY Carrier) a INNER JOIN ( SELECT  Carrier, count(*) AS c2 FROM ontime  WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier ) b on a.Carrier=b.Carrier ORDER BY c3 DESC;
Q7SELECT Carrier, avg(DepDelay) * 1000 AS c3 FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier;
Q8SELECT Year, avg(DepDelay) FROM ontime GROUP BY Year;
Q9select Year, count(*) as c1 from ontime group by Year;
Q10SELECT avg(cnt) FROM (SELECT Year,Month,count(*) AS cnt FROM ontime WHERE DepDel15=1 GROUP BY Year,Month) a;
Q11select avg(c1) from (select Year,Month,count(*) as c1 from ontime group by Year,Month) a;
Q12SELECT OriginCityName, DestCityName, count(*) AS c FROM ontime GROUP BY OriginCityName, DestCityName ORDER BY c DESC LIMIT 10;
Q13SELECT OriginCityName, count(*) AS c FROM ontime GROUP BY OriginCityName ORDER BY c DESC LIMIT 10;
Query Contains Joins
Q14SELECT a.Year, c1/c2 FROM ( select Year, count(*)*1000 as c1 from ontime WHERE DepDelay>10 GROUP BY Year) a
INNER JOIN (select Year, count(*) as c2 from ontime GROUP BY Year ) b on a.Year=b.Year ORDER BY a.Year;
Q15SELECT a.”Year”, c1/c2 FROM ( select “Year”, count(*)*1000 as c1 FROM fontime WHERE “DepDelay”>10 GROUP BY “Year”) a INNER JOIN (select “Year”, count(*) as c2
FROM fontime GROUP BY “Year” ) b on a.”Year”=b.”Year”;

Table-1: Queries used in benchmark

Query executions

Here are the results from the each of the queries when run in different database set ups: PostgreSQL with and without indexes, native ClickHouse and clickhousedb_fdw. The time is shown in milliseconds. 

Q#PostgreSQLPostgreSQL (Indexed)ClickHouseclickhousedb_fdw
Q127920196342357
Q235124173015080
Q3340461561867115
Q43163276672537
Q54722089762760
Q6582332436855153
Q730566132565291
Q83830960511112179
Q920674379793181
Q10349902010256148
Q11304895165837155
Q1239357337421861333
Q132991230709101384
Q1454126399131241364212
Q159725830211245259

Table-1: Time taken to execute the queries used in benchmark

Reviewing the results

The graph shows the query execution time in milliseconds, the X-axis shows the query number from the tables above, while the Y-axis shows the execution time in milliseconds. The results for ClickHouse and the data accessed from postgres using clickhousedb_fdw are shown. From the table, you can see there is a huge difference between PostgreSQL and ClickHouse, but there is minimal difference between ClickHouse and clickhousedb_fdw.

 

Clickhouse Vs Clickhousedb_fdw (Shows the overhead of clickhousedb_fdw)

This graph shows the difference between ClickhouseDB and clickhousedb_fdw. In most of the queries, the FDW overhead is not that great, and barely significant apart from in Q12. This query involves joins and ORDER BY clause. Because of the ORDER BY clause the GROUP/BY and ORDER BY does not push down to ClickHouse.

In Table-2 we can see the spike of time in query Q12 and Q13. To reiterate, this is caused by the ORDER BY clause. To confirm this, I ran a queries Q-14 and Q-15 with and without the ORDER BY clause.  Without the ORDER BY clause the completion time is  259ms and with ORDER BY clause, it is 1364212. To debug that the query I explain both the queries, and here is the results of explain.

bm=# EXPLAIN VERBOSE SELECT a."Year", c1/c2
     FROM (SELECT "Year", count(*)*1000 AS c1 FROM fontime WHERE "DepDelay" > 10 GROUP BY "Year") a
     INNER JOIN(SELECT "Year", count(*) AS c2 FROM fontime GROUP BY "Year") b ON a."Year"=b."Year";

 

                                                    QUERY PLAN                                                      
Hash Join  (cost=2250.00..128516.06 rows=50000000 width=12)  
Output: fontime."Year", (((count(*) * 1000)) / b.c2)  
Inner Unique: true   Hash Cond: (fontime."Year" = b."Year")  
->  Foreign Scan  (cost=1.00..-1.00 rows=100000 width=12)        
Output: fontime."Year", ((count(*) * 1000))        
Relations: Aggregate on (fontime)        
Remote SQL: SELECT "Year", (count(*) * 1000) FROM "default".ontime WHERE (("DepDelay" > 10)) GROUP BY "Year"  
->  Hash  (cost=999.00..999.00 rows=100000 width=12)        
Output: b.c2, b."Year"        
->  Subquery Scan on b  (cost=1.00..999.00 rows=100000 width=12)              
Output: b.c2, b."Year"              
->  Foreign Scan  (cost=1.00..-1.00 rows=100000 width=12)                    
Output: fontime_1."Year", (count(*))                    
Relations: Aggregate on (fontime)                    
Remote SQL: SELECT "Year", count(*) FROM "default".ontime GROUP BY "Year"(16 rows)

 

bm=# EXPLAIN VERBOSE SELECT a."Year", c1/c2 FROM(SELECT "Year", count(*)*1000 AS c1 FROM fontime WHERE "DepDelay" > 10 GROUP BY "Year") a
     INNER JOIN(SELECT "Year", count(*) as c2 FROM fontime GROUP BY "Year") b  ON a."Year"= b."Year"
     ORDER BY a."Year";

 

                                                          QUERY PLAN
Merge Join  (cost=2.00..628498.02 rows=50000000 width=12)  
Output: fontime."Year", (((count(*) * 1000)) / (count(*)))  
Inner Unique: true   Merge Cond: (fontime."Year" = fontime_1."Year")  
->  GroupAggregate  (cost=1.00..499.01 rows=1 width=12)       
Output: fontime."Year", (count(*) * 1000)        
Group Key: fontime."Year"        
->  Foreign Scan on public.fontime  (cost=1.00..-1.00 rows=100000 width=4)              
Remote SQL: SELECT "Year" FROM "default".ontime WHERE (("DepDelay" > 10))
            ORDER BY "Year" ASC  
->  GroupAggregate  (cost=1.00..499.01 rows=1 width=12)        
Output: fontime_1."Year", count(*)         Group Key: fontime_1."Year"        
->  Foreign Scan on public.fontime fontime_1  (cost=1.00..-1.00 rows=100000 width=4) 
             
Remote SQL: SELECT "Year" FROM "default".ontime ORDER BY "Year" ASC(16 rows)

Conclusion

The results from these experiments show that ClickHouse offers really good performance, and clickhousedb_fdw offers the benefits of ClickHouse performance from within PostgreSQL. While there is some overhead when using clickhousedb_fdw, it is negligible and is comparable to the performance achieved when running natively within the ClickHouse database. This also confirms that the PostgreSQL foreign data wrapper push-down feature provides wonderful results.

Pavel Stehule: pspg on Solaris

$
0
0
I fixed some issues and pspg can be used on Solaris too. I found some issues on Solaris side on utf8 support - but it is related just for subset of chars. Due this issues, don't use psql unicode borders.

Pavel Golub: 1-to-1 relationship in PostgreSQL for real

$
0
0

Years ago

Years ago I wrote the post describing how to implement 1-to-1 relationship in PostgreSQL. The trick was simple and obvious:

CREATE TABLE UserProfiles (
        UProfileID BIGSERIAL PRIMARY KEY,
...
);

CREATE TABLE Users (
        UID BIGSERIAL PRIMARY KEY,
        UProfileID int8 NOT NULL,
...
        UNIQUE(UProfileID),
        FOREIGN KEY(UProfileID) REFERENCES Users(UProfileID)
);

You put a unique constraint on a referenced column and you’re fine. But then one of the readers noticed, that this is the 1-to-(0..1) relationship, not a true 1-to-1. And he was absolutely correct.

Keep it simple stupid!

A lot of time is gone and now we can do this trick much simpler using modern features or PostgreSQL. Let’s check

BEGIN;

CREATE TABLE uProfiles (
        uid int8 PRIMARY KEY,
        payload jsonb NOT NULL
);

CREATE TABLE Users (
        uid int8 PRIMARY KEY,
        uname text NOT NULL,
        FOREIGN KEY (uid) REFERENCES uProfiles (uid)
);

ALTER TABLE uProfiles 
	ADD FOREIGN KEY (uid) REFERENCES Users (uid);

INSERT INTO Users VALUES (1, 'Pavlo Golub');

INSERT INTO uProfiles VALUES (1, '{}');

COMMIT;

Things are obvious. We create two tables and reference each other using the same columns in both ways.
Moreover, in such model both our foreign keys are automatically indexed!
Seems legit, but executing this script will produce the error:

SQL Error [23503]: ERROR: insert or update on table "users" 
   violates foreign key constraint "users_uid_fkey"
Detail: Key (uid)=(1) is not present in table "uprofiles".

Oops. And that was the pitfall preventing the easy solutions years ago during my first post.

What about now?

But now we have DEFERRABLE constraints:

This controls whether the constraint can be deferred. A constraint that is not deferrable will be checked immediately after every command. Checking of constraints that are deferrable can be postponed until the end of the transaction (using the SET CONSTRAINTS command). NOT DEFERRABLE is the default. Currently, only UNIQUE, PRIMARY KEY, EXCLUDE, and REFERENCES (foreign key) constraints accept this clause. NOT NULL and CHECK constraints are not deferrable. Note that deferrable constraints cannot be used as conflict arbitrators in an INSERT statement that includes an ON CONFLICT DO UPDATE clause.

So the trick is we do not check data consistency till the end of the transaction. Let’s try!

BEGIN;

CREATE TABLE uProfiles (
        uid int8 NOT NULL PRIMARY KEY,
        payload jsonb NOT NULL
);

CREATE TABLE Users (
        uid int8 NOT NULL PRIMARY KEY,
        uname text NOT NULL
);

ALTER TABLE Users
        ADD FOREIGN KEY (uid) REFERENCES uProfiles (uid)
                DEFERRABLE INITIALLY DEFERRED;

ALTER TABLE uProfiles 
        ADD FOREIGN KEY (uid) REFERENCES Users (uid)
                DEFERRABLE INITIALLY DEFERRED;

INSERT INTO Users VALUES (1, 'Pavlo Golub');

INSERT INTO uProfiles VALUES (1, '{}');

COMMIT;

Neat! Works like a charm!

SELECT * FROM Users, uProfiles;
uid|uname      |uid|payload|
---|-----------|---|-------|
  1|Pavlo Golub|  1|{}     |

Conclusion

I am still eager to see the real-life situation where such a 1-to-1 model is necessary. From my perspective, this method may help in splitting wide tables into several narrow. Where some columns are heavily read. If you have any other thoughts on your mind, shoot them up!

May ACID be with you!

The post 1-to-1 relationship in PostgreSQL for real appeared first on Cybertec.

Ernst-Georg Schmid: Not all CASTs are created equal?

$
0
0
Can somebody explain this?

PostgreSQL 11.2.

The documentation says:

"A type cast specifies a conversion from one data type to another. PostgreSQL accepts two equivalent syntaxes for type casts:

CAST ( expression AS type )

expression::type

The CAST syntax conforms to SQL; the syntax with :: is historical PostgreSQL usage."

But when I test the lower limits of PostgreSQL's integer types, strange things happen.

select cast(-9223372036854775808 as bigint);
select cast(-2147483648 as integer);
select cast(-32768 as smallint);

All OK.

select -9223372036854775808::bigint;
select -2147483648::integer;
select -32768::smallint;

All fail with SQL Error [22003]: ERROR: out of range

But:

select -9223372036854775807::bigint;
select -2147483647::integer;
select -32767::smallint;

All OK.

???

Sebastian Insausti: How to Use pgBackRest to Backup PostgreSQL and TimescaleDB

$
0
0

Your data is probably the most valuable assets in the company, so you should have a Disaster Recovery Plan (DRP) to prevent data loss in the event of an accident or hardware failure. A backup is the simplest form of DR. It might not always be enough to guarantee an acceptable Recovery Point Objective (RPO) but is a good first approach. Also, you should define a Recovery Time Objective (RTO) according to your company requirements. There are many ways to reach the RTO value, it depends on the company goals.

In this blog, we’ll see how to use pgBackRest for backing up PostgreSQL and TimescaleDB and how to use one of the most important features of this backup tool, the combination of Full, Incremental and Differential backups, to minimize downtime.

What is pgBackRest?

There are different types of backups for databases:

  • Logical: The backup is stored in a human-readable format like SQL.
  • Physical: The backup contains binary data.
  • Full/Incremental/Differential: The definition of these three types of backups is implicit in the name. The full backup is a full copy of all your data. Incremental backup only backs up the data that has changed since the previous backup and the differential backup only contains the data that has changed since the last full backup executed. The incremental and differential backups were introduced as a way to decrease the amount of time and disk space usage that it takes to perform a full backup.

pgBackRest is an open source backup tool that creates physical backups with some improvements compared to the classic pg_basebackup tool. We can use pgBackRest to perform an initial database copy for Streaming Replication by using an existing backup, or we can use the delta option to rebuild an old standby server.

Some of the most important pgBackRest features are:

  • Parallel Backup & Restore
  • Local or Remote Operation
  • Full, Incremental and Differential Backups
  • Backup Rotation and Archive Expiration
  • Backup Integrity check
  • Backup Resume
  • Delta Restore
  • Encryption

Now, let’s see how we can use pgBackRest to backup our PostgreSQL and TimescaleDB databases.

How to Use pgBackRest

For this test, we’ll use CentOS 7 as OS and PostgreSQL 11 as the database server. We’ll assume you have the database installed, if not you can follow these links to deploy both PostgreSQL or TimescaleDB in an easy way by using ClusterControl.

First, we need to install the pgbackrest package.

$ yum install pgbackrest

pgBackRest can be used from the command line, or from a configuration file located by default in /etc/pgbackrest.conf on CentOS7. This file contains the following lines:

[global]
repo1-path=/var/lib/pgbackrest
#[main]
#pg1-path=/var/lib/pgsql/10/data

You can check this link to see which parameter we can add in this configuration file.

We’ll add the following lines:

[testing]
pg1-path=/var/lib/pgsql/11/data

Make sure that you have the following configuration added in the postgresql.conf file (these changes require a service restart).

archive_mode = on
archive_command = 'pgbackrest --stanza=testing archive-push %p'
max_wal_senders = 3
wal_level = logical

Now, let’s take a basic backup. First, we need to create a “stanza”, that defines the backup configuration for a specific PostgreSQL or TimescaleDB database cluster. The stanza section must define the database cluster path and host/user if the database cluster is remote.

$ pgbackrest --stanza=testing --log-level-console=info stanza-create
2019-04-29 21:46:36.922 P00   INFO: stanza-create command begin 2.13: --log-level-console=info --pg1-path=/var/lib/pgsql/11/data --repo1-path=/var/lib/pgbackrest --stanza=testing
2019-04-29 21:46:37.475 P00   INFO: stanza-create command end: completed successfully (554ms)

And then, we can run the check command to validate the configuration.

$ pgbackrest --stanza=testing --log-level-console=info check
2019-04-29 21:51:09.893 P00   INFO: check command begin 2.13: --log-level-console=info --pg1-path=/var/lib/pgsql/11/data --repo1-path=/var/lib/pgbackrest --stanza=testing
2019-04-29 21:51:12.090 P00   INFO: WAL segment 000000010000000000000001 successfully stored in the archive at '/var/lib/pgbackrest/archive/testing/11-1/0000000100000000/000000010000000000000001-f29875cffe780f9e9d9debeb0b44d945a5165409.gz'
2019-04-29 21:51:12.090 P00   INFO: check command end: completed successfully (2197ms)

To take the backup, run the following command:

$ pgbackrest --stanza=testing --type=full --log-level-stderr=info backup
INFO: backup command begin 2.13: --log-level-stderr=info --pg1-path=/var/lib/pgsql/11/data --repo1-path=/var/lib/pgbackrest --stanza=testing --type=full
WARN: option repo1-retention-full is not set, the repository may run out of space
      HINT: to retain full backups indefinitely (without warning), set option 'repo1-retention-full' to the maximum.
INFO: execute non-exclusive pg_start_backup() with label "pgBackRest backup started at 2019-04-30 15:43:21": backup begins after the next regular checkpoint completes
INFO: backup start archive = 000000010000000000000006, lsn = 0/6000028
WARN: aborted backup 20190429-215508F of same type exists, will be cleaned to remove invalid files and resumed
INFO: backup file /var/lib/pgsql/11/data/base/16384/1255 (608KB, 1%) checksum e560330eb5300f7e2bcf8260f37f36660ce3a2c1
INFO: backup file /var/lib/pgsql/11/data/base/13878/1255 (608KB, 3%) checksum e560330eb5300f7e2bcf8260f37f36660ce3a2c1
INFO: backup file /var/lib/pgsql/11/data/base/13877/1255 (608KB, 5%) checksum e560330eb5300f7e2bcf8260f37f36660ce3a2c1
. . .
INFO: full backup size = 31.8MB
INFO: execute non-exclusive pg_stop_backup() and wait for all WAL segments to archive
INFO: backup stop archive = 000000010000000000000006, lsn = 0/6000130
INFO: new backup label = 20190429-215508F
INFO: backup command end: completed successfully (12810ms)
INFO: expire command begin
INFO: option 'repo1-retention-archive' is not set - archive logs will not be expired
INFO: expire command end: completed successfully (10ms)

Now, we have the backup finished with the “completed successfully” output, so, let’s go to restore it. We’ll stop the postgresql-11 service.

$ service postgresql-11 stop
Redirecting to /bin/systemctl stop postgresql-11.service

And leave the datadir empty.

$ rm -rf /var/lib/pgsql/11/data/*

Now, run the following command:

$ pgbackrest --stanza=testing --log-level-stderr=info restore
INFO: restore command begin 2.13: --log-level-stderr=info --pg1-path=/var/lib/pgsql/11/data --repo1-path=/var/lib/pgbackrest --stanza=testing
INFO: restore backup set 20190429-215508F
INFO: restore file /var/lib/pgsql/11/data/base/16384/1255 (608KB, 1%) checksum e560330eb5300f7e2bcf8260f37f36660ce3a2c1
INFO: restore file /var/lib/pgsql/11/data/base/13878/1255 (608KB, 3%) checksum e560330eb5300f7e2bcf8260f37f36660ce3a2c1
INFO: restore file /var/lib/pgsql/11/data/base/13877/1255 (608KB, 5%) checksum e560330eb5300f7e2bcf8260f37f36660ce3a2c1
. . .
INFO: write /var/lib/pgsql/11/data/recovery.conf
INFO: restore global/pg_control (performed last to ensure aborted restores cannot be started)
INFO: restore command end: completed successfully (10819ms)

Then, start the postgresql-11 service.

$ service postgresql-11 stop

And now we have our database up and running.

$ psql -U app_user world
world=> select * from city limit 5;
 id |      name      | countrycode |   district    | population
----+----------------+-------------+---------------+------------
  1 | Kabul          | AFG         | Kabol         |    1780000
  2 | Qandahar       | AFG         | Qandahar      |     237500
  3 | Herat          | AFG         | Herat         |     186800
  4 | Mazar-e-Sharif | AFG         | Balkh         |     127800
  5 | Amsterdam      | NLD         | Noord-Holland |     731200
(5 rows)

Now, let’s see how we can take a differential backup.

$ pgbackrest --stanza=testing --type=diff --log-level-stderr=info backup
INFO: backup command begin 2.13: --log-level-stderr=info --pg1-path=/var/lib/pgsql/11/data --repo1-path=/var/lib/pgbackrest --stanza=testing --type=diff
WARN: option repo1-retention-full is not set, the repository may run out of space
      HINT: to retain full backups indefinitely (without warning), set option 'repo1-retention-full' to the maximum.
INFO: last backup label = 20190429-215508F, version = 2.13
INFO: execute non-exclusive pg_start_backup() with label "pgBackRest backup started at 2019-04-30 21:22:58": backup begins after the next regular checkpoint completes
INFO: backup start archive = 00000002000000000000000B, lsn = 0/B000028
WARN: a timeline switch has occurred since the last backup, enabling delta checksum
INFO: backup file /var/lib/pgsql/11/data/base/16429/1255 (608KB, 1%) checksum e560330eb5300f7e2bcf8260f37f36660ce3a2c1
INFO: backup file /var/lib/pgsql/11/data/base/16429/2608 (448KB, 8%) checksum 53bd7995dc4d29226b1ad645995405e0a96a4a7b
. . .
INFO: diff backup size = 40.1MB
INFO: execute non-exclusive pg_stop_backup() and wait for all WAL segments to archive
INFO: backup stop archive = 00000002000000000000000B, lsn = 0/B000130
INFO: new backup label = 20190429-215508F_20190430-212258D
INFO: backup command end: completed successfully (23982ms)
INFO: expire command begin
INFO: option 'repo1-retention-archive' is not set - archive logs will not be expired
INFO: expire command end: completed successfully (14ms)

For more complex backups you can follow the pgBackRest user guide.

As we mentioned earlier, you can use the command line or the configuration files to manage your backups.

ClusterControl
Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

How to Use pgBackRest in ClusterControl

Since 1.7.2 version, ClusterControl added support for pgBackRest for backing up PostgreSQL and TimescaleDB databases, so let’s see how we can use it from ClusterControl.

Creating a Backup

For this task, go to ClusterControl -> Select Cluster -> Backup -> Create Backup.

We can create a new backup or configure a scheduled one. For our example, we will create a single backup instantly.

We must choose one method, the server from which the backup will be taken, and where we want to store the backup. We can also upload our backup to the cloud (AWS, Google or Azure) by enabling the corresponding button.

In this case, we’ll choose the pgbackrestfull method to take an initial full backup. When selecting this option, we’ll see the following red note:

“During first attempt of making pgBackRest backup, ClusterControl will re-configure the node (deploys and configures pgBackRest) and after that the db node needs to be restarted first.”

So, please, take it into account for the first backup attempt.

Then we specify the use of compression and the compression level for our backup.

On the backup section, we can see the progress of the backup, and information like the method, size, location, and more.

The steps are the same to create a differential of incremental backup. We only need to choose the wanted method during the backup creation.

Restoring a Backup

Once the backup is finished, we can restore it by using ClusterControl. For this, in our backup section (ClusterControl -> Select Cluster -> Backup), we can select "Restore Backup", or directly "Restore" on the backup that we want to restore.

We have three options to restore the backup. We can restore the backup in an existing database node, restore and verify the backup on a standalone host or create a new cluster from the backup.

If we choose the Restore on Node option, we must specify the Master node, because it’s the only one writable in the cluster.

We can monitor the progress of our restore from the Activity section in our ClusterControl.

Automatic Backup Verification

A backup is not a backup if it's not restorable. Verifying backups is something that is usually neglected by many. Let’s see how ClusterControl can automate the verification of PostgreSQL and TimescaleDB backups and help avoid any surprises.

In ClusterControl, select your cluster and go to the "Backup" section, then, select “Create Backup”.

The automatic verify backup feature is available for the scheduled backups. So, let’s choose the “Schedule Backup” option.

When scheduling a backup, in addition to selecting the common options like method or storage, we also need to specify schedule/frequency.

In the next step, we can compress our backup and enable the “Verify Backup” feature.

To use this feature, we need a dedicated host (or VM) that is not part of the cluster.

ClusterControl will install the software and it’ll restore the backup in this host. After restoring, we can see the verification icon in the ClusterControl Backup section.

Recommendations

There are also some tips that we can take into account when creating our backups:

  • Store the backup on a remote location: We shouldn’t store the backup on the database server. In case of server failure, we could lose the database and the backup at the same time.
  • Keep a copy of the latest backup on the database server: This could be useful for faster recovery.
  • Use incremental/differential backups: To reduce the backup recovery time and disk space usage.
  • Backup the WALs: If we need to restore a database from the last backup, if you only restore it, you’ll lose the changes since the backup was taken until the restore time, but if we have the WALs we can apply the changes and we can use PITR.
  • Use both Logical and Physical backups: Both are necessary for different reasons, for example, if we want to restore only one database/table, we don’t need the physical backup, we only need the logical backup and it’ll be even faster that restoring the entire server.
  • Take backups from standby nodes (if it’s possible): To avoid extra load on the primary node, it’s a good practice to take the backup from the standby server.
  • Test your backups: The confirmation that the backup is done is not enough to ensure the backup is working. We should restore it on a standalone server and test it to avoid a surprise in case of failure.

Conclusion

As we could see, pgBackRest is a good option to improve our backup strategy. It helps you protect your data and it could be useful to reach the RTO by reducing the downtime in case of failure. Incremental backups can help reduce the amount of time and storage space used for the backup process. ClusterControl can help automate the backup process for your PostgreSQL and TimescaleDB databases and, in case of failure, restore it with a few clicks.

Craig Kerstiens: Introducing Hyperscale (Citus) on Azure Database for PostgreSQL

$
0
0

For roughly ten years now, I’ve had the pleasure of running and managing databases for people. In the early stages of building an application you move quickly, adding new tables and columns to your Postgres database to support new functionality. You move quickly, but you don’t worry too much because things are fast and responsive–largely because your data is small. Over time your application grows and matures. Your data model stabilizes, and you start to spend more time tuning and tweaking to ensure performance and stability stay where they need to. Eventually you get to the point where you miss the days of maintaining a small database, because life was easier then. Indexes were created quickly, joins were fast, count(*) didn’t bring your database to a screeching halt, and vacuum was not a regular part of your lunchtime conversation. As you continue to tweak and optimize the system, you know you need a plan for the future and know how you’re going to continue to scale.

Now in Preview: Introducing Hyperscale (Citus) on Azure Database for PostgreSQL

With Hyperscale (Citus) on Azure Database for PostgreSQL, we help many of those worries fade away. I am super excited to announce that Citus is now available on Microsoft Azure, as a new deployment option on the Azure Database for PostgreSQL called Hyperscale (Citus).

Hyperscale (Citus) scales out your data across multiple physical nodes, with the underlying data being sharded into much smaller bits. The same database sharding principles that work for Facebook and Google are baked right into the database. But, unlike traditional sharded systems, your application doesn’t have to learn how to shard the data. With Azure Database on PostgreSQL, Hyperscale (Citus) takes Postgres, the open source relational database, and extends it with low level internal hooks.

This means you can go back to building new features and functionality, without having to deal with a massive database that is a pain to maintain. When you provision a Hyperscale (Citus) server group, you’ll have a coordinator node which is responsible for distributed planning, query routing, aggregation, as well asas number of worker nodes which you specify. The database cluster will come already setup with the Citus extension and preconfigured so you can begin adding data right away. Let’s dig in and get hands on to see how we can begin scaling and quit worrying about our database with Hyperscale (Citus).

Now with my new Hyperscale (Citus) server group (what I often call a database cluster), I can connect directly to a coordinator. My Hyperscale (Citus) coordinator is responsible for distributed query planning, routing, and aggregation at times. You can think of it mostly as air traffic control, directing the planes which are doing all the heavy lifting. Once I connect to my Citus cluster I can create some standard Postgres tables. In this case I’m going to use some data you may be well used to working with from the other side: GitHub. GitHub thankfully makes much of their data publicly available.

Getting started with Hyperscale (Citus) on Azure Database for PostgreSQL

To start with I’m going to connect with psql or Azure Data Studio and create my tables:

CREATETABLEgithub_events(event_idbigint,event_typetext,event_publicboolean,repo_idbigint,payloadjsonb,repojsonb,user_idbigint,orgjsonb,created_attimestamp);CREATETABLEgithub_users(user_idbigint,urltext,logintext,avatar_urltext,gravatar_idtext,display_logintext);

With Hyperscale (Citus), our cluster has some new capabilities that allow us to shard our data. By default when we run the command to distribute our data it will create 32 shards (this is configurable) and spread them across the nodes in your Postgres cluster:

SELECTcreate_distributed_table('github_events','user_id');SELECTcreate_distributed_table('github_users','user_id');

Because under the covers each of my tables is sharded into something that looks roughly like: github_events_001, github_events_002, github_events_003, etc. each table is significantly smaller. This means all those operations before that were intensive and stress inducing on a large database are worry free.

Now let’s load up some data and run some queries. To start with we’re going to use the bulk load utility copy to load our data:

\copygithub_eventsfromevents.csvCSV;\copygithub_usersfromusers.csvCSV;

We just loaded a little over 1,000,000 events in right at 30 seconds. For comparison sake I did the same on a single node Postgres setup, went and made myself some coffee, came back, drank my coffee, and waited for my load to finish. In reality my load on a single node server finished in 4 minutes, but 8x improvement is nice–especially knowing I acn add more nodes to make things faster. This parallelization thing is kinda fun. Querying my data is the same as it always has been, execute standard SQL. Except now Hyperscale (Citus) takes care of parallelizing my queries, managing distributed transactions, and more.

selectcount(*)fromgithub_events;count---------1009960(1row)Time:29.848ms

In addition to parallelizing the load Hyperscale (Citus) can parallelize queries, index creation, autovacuum… in general just about all the operations that tend to keep you awake a night when it comes to taking care of your database. If you’re curious to try Azure Database for PostgreSQL Hyperscale (Citus) create your Azure account and give some of our tutorials a try.

Andreas Scherbaum: Google Summer of Code 2019 - PostgreSQL participates with 5 projects

$
0
0
Author
Andreas 'ads' Scherbaum

For the 13th year, the PostgreSQL Project is participating in Google Summer of Code (GSoC). This project is a great opportunity to let students learn about Open Source projects, and help them deliver new features. It is also a chance to engage the students beyond just one summer, and grow them into active contributors.

In GSoC, students first learn about the Open Source organization, and either pick a summer project from the list provided by the org, or submit their own idea for review. After a “community bonding” period, the students have time to implement their idea, under supervision of mentors from the Open Source organization. There is also an incentive: first, Google pays the students for their work on improving Open Source projects. And second, having a completed GSoC project in a CV is well recognized.

Continue reading "Google Summer of Code 2019 - PostgreSQL participates with 5 projects"

Hans-Juergen Schoenig: PostgreSQL: Using CREATE USER with caution

$
0
0

PostgreSQL offers powerful means to manage users / roles and enables administrators to implement everything from simple to really complex security concepts. However, if the PostgreSQL security machinery is not used wisely, things might become a bit rough.

This fairly short post will try to shed some light on to this topic.

The golden rule: Distinguish between users and roles

The most important thing you got to remember is the following: You cannot drop a user unless there are no more permissions, objects, policies, tablespaces, and so on are assigned to it. Here is an example:

test=# CREATE TABLE a (aid int);
CREATE TABLE
test=# CREATE USER joe;
CREATE ROLE
test=# GRANT SELECT ON a TO joe;
GRANT

As you can see “joe” has a single permission and there is already no way to kill the user without revoking the permission first:

test=# DROP USER joe;
ERROR: role "joe" cannot be dropped because some objects depend on it
DETAIL: privileges for table a

Note that there is not such thing as “DROP USER … CASCADE” – it does not exist. The reason for that is that users are created at the instance level. A user can therefore have rights in potentially dozens of PostgreSQL databases. If you drop a user you cannot just blindly remove objects from other databases. It is therefore necessary to revoke all permissions first before a user can be removed. That can be a real issue if your deployments grow in size.

Using roles to abstract tasks

One thing we have seen over the years is: Tasks tend to exist longer than staff. Even after hiring and firing cleaning staff for your office 5 times the task is still the same: Somebody is going to clean your office twice a week. It can therefore make sense to abstract the tasks performed by “cleaning_staff” in a role, which is then assigned to individual people.

How can one implement this kind of abstraction?

test=# CREATE ROLE cleaning_staff NOLOGIN;
CREATE ROLE
test=# GRANT SELECT ON a TO cleaning_staff;
GRANT
test=# GRANT cleaning_staff TO joe;
GRANT ROLE

First we create a role called “cleaning_staff” and assign whatever permissions to that role. In the next step the role is assigned to “joe” to make sure that joe has all the permissions a typical cleaning person usually has. If only roles are assigned to real people such as joe, it is a lot easier to remove those people from the system again.

Inspecting permissions

If you want to take a look at how permissions are set on your system, consider checking out pg_permission, which is available for free on our Github page: https://github.com/cybertec-postgresql/pg_permission

Just do a …

SELECT * FROM all_permissions;

… and filter for the desired role. You can then see at a glance, which permissions are set at the moment. You can also run UPDATE on this view and PostgreSQL will automatically generate the necessary GRANT / REVOKE commands to adjust the underlying ACLs.

The post PostgreSQL: Using CREATE USER with caution appeared first on Cybertec.

Sebastian Insausti: How to Deploy PostgreSQL to a Docker Container Using ClusterControl

$
0
0

Docker has become the most common tool to create, deploy, and run applications by using containers. It allows us to package up an application with all of the parts it needs, such as libraries and other dependencies, and ship it all out as one package. Docker could be considered as a virtual machine, but instead of creating a whole virtual operating system, Docker allows applications to use the same Linux kernel as the system that they're running on and only requires applications to be shipped with things not already running on the host computer. This gives a significant performance boost and reduces the size of the application.

In this blog, we’ll see how we can easily deploy a PostgreSQL setup via Docker, and how to turn our setup in a primary/standby replication setup by using ClusterControl.

How to Deploy PostgreSQL with Docker

First, let’s see how to deploy PostgreSQL with Docker manually by using a PostgreSQL Docker Image.

The image is available on Docker Hub and you can find it from the command line:

$ docker search postgres
NAME                                         DESCRIPTION                                     STARS               OFFICIAL            AUTOMATED
postgres                                     The PostgreSQL object-relational database sy…   6519                [OK]

We’ll take the first result. The Official one. So, we need to pull the image:

$ docker pull postgres

And run the node containers mapping a local port to the database port into the container:

$ docker run -d --name node1 -p 6551:5432 postgres
$ docker run -d --name node2 -p 6552:5432 postgres
$ docker run -d --name node3 -p 6553:5432 postgres

After running these commands, you should have this Docker environment created:

$ docker ps
CONTAINER ID        IMAGE                         COMMAND                  CREATED             STATUS                 PORTS                                                                                     NAMES
51038dbe21f8        postgres                      "docker-entrypoint.s…"   About an hour ago   Up About an hour       0.0.0.0:6553->5432/tcp                                                                    node3
b7a4211744e3        postgres                      "docker-entrypoint.s…"   About an hour ago   Up About an hour       0.0.0.0:6552->5432/tcp                                                                    node2
229c6bd23ff4        postgres                      "docker-entrypoint.s…"   About an hour ago   Up About an hour       0.0.0.0:6551->5432/tcp                                                                    node1

Now, you can access each node with the following command:

$ docker exec -ti [db-container] bash
$ su postgres
$ psql
psql (11.2 (Debian 11.2-1.pgdg90+1))
Type "help" for help.
postgres=#

Then, you can create a database user, change the configuration according to your requirements or configure replication between the nodes manually.

How to Import Your PostgreSQL Containers into ClusterControl

Now that you've setup your PostgreSQL cluster, you still need to monitor it, alert in case of performance issues, manage backups, detect failures and automatically failover to a healthy server.

If you already have a PostgreSQL cluster running on Docker and you want ClusterControl to manage it, you can simply run the ClusterControl container in the same Docker network as the database containers. The only requirement is to ensure the target containers have SSH related packages installed (openssh-server, openssh-clients). Then allow passwordless SSH from ClusterControl to the database containers. Once done, use the “Import Existing Server/Cluster” feature and the cluster should be imported into ClusterControl.

First, let’s Install OpenSSH related packages on the database containers, allow the root login, start it up and set the root password:

$ docker exec -ti [db-container] apt-get update
$ docker exec -ti [db-container] apt-get install -y openssh-server openssh-client
$ docker exec -it [db-container] sed -i 's|^PermitRootLogin.*|PermitRootLogin yes|g' /etc/ssh/sshd_config
$ docker exec -it [db-container] sed -i 's|^#PermitRootLogin.*|PermitRootLogin yes|g' /etc/ssh/sshd_config
$ docker exec -ti [db-container] service ssh start
$ docker exec -it [db-container] passwd

Start the ClusterControl container (if it’s not started) and forward port 80 on the container to port 5000 on the host:

$ docker run -d --name clustercontrol -p 5000:80 severalnines/clustercontrol

Verify the ClusterControl container is up:

$ docker ps | grep clustercontrol
7eadb6bb72fb        severalnines/clustercontrol   "/entrypoint.sh"         4 hours ago         Up 4 hours (healthy)   22/tcp, 443/tcp, 3306/tcp, 9500-9501/tcp, 9510-9511/tcp, 9999/tcp, 0.0.0.0:5000->80/tcp   clustercontrol

Open a web browser, go to http://[Docker_Host]:5000/clustercontrol and create a default admin user and password. You should now see the ClusterControl main page.

The last step is setting up the passwordless SSH to all database containers. For this, we need to know the IP Address for each database node. To know it, we can run the following command for each node:

$ docker inspect [db-container] |grep IPAddress
            "IPAddress": "172.17.0.6",

Then, attach to the ClusterControl container interactive console:

$ docker exec -it clustercontrol bash

Copy the SSH key to all database containers:

$ ssh-copy-id 172.17.0.6
$ ssh-copy-id 172.17.0.7
$ ssh-copy-id 172.17.0.8

Now, we can start to import the cluster into ClusterControl. Open a web browser and go to Docker’s physical host IP address with the mapped port, e.g http://192.168.100.150:5000/clustercontrol, click on “Import Existing Server/Cluster”, and then add the following information.

We must specify User, Key or Password and port to connect by SSH to our servers. We also need a name for our new cluster.

After setting up the SSH access information, we must define the database user, version, basedir and the IP Address or Hostname for each database node.

Make sure you get the green tick when entering the hostname or IP address, indicating ClusterControl is able to communicate with the node. Then, click the Import button and wait until ClusterControl finishes its job. You can monitor the process in the ClusterControl Activity Section.

The database cluster will be listed under the ClusterControl dashboard once imported.

Note that, if you only have a PostgreSQL master node, you could add it into ClusterControl. Then you can add the standby nodes from the ClusterControl UI to allow ClusterControl to configure them for you.

ClusterControl
Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

How to Deploy Your PostgreSQL Containers with ClusterControl

Now, let’s see how to deploy PostgreSQL with Docker by using a CentOS Docker Image (severalnines/centos-ssh) and a ClusterControl Docker Image (severalnines/clustercontrol).

First, we’ll deploy a ClusterControl Docker Container using the latest version, so we need to pull the severalnines/clustercontrol Docker Image.

$ docker pull severalnines/clustercontrol

Then, we’ll run the ClusterControl container and publish the port 5000 to access it.

$ docker run -d --name clustercontrol -p 5000:80 severalnines/clustercontrol

Now you can open the ClusterControl UI at http://[Docker_Host]:5000/clustercontrol and create a default admin user and password.

The severalnines/centos-ssh comes with, in addition to the SSH service enabled, an Auto Deployment feature, but it’s only valid for Galera Cluster. PostgreSQL is not supported yet. So, we’ll set the AUTO_DEPLOYMENT variable in 0 in the docker run command to create the databases nodes.

$ docker run -d --name postgres1 -p 5551:5432 --link clustercontrol:clustercontrol -e AUTO_DEPLOYMENT=0 severalnines/centos-ssh
$ docker run -d --name postgres2 -p 5552:5432 --link clustercontrol:clustercontrol -e AUTO_DEPLOYMENT=0 severalnines/centos-ssh
$ docker run -d --name postgres3 -p 5553:5432 --link clustercontrol:clustercontrol -e AUTO_DEPLOYMENT=0 severalnines/centos-ssh

After running these commands, we should have the following Docker environment:

$ docker ps
CONTAINER ID        IMAGE                         COMMAND             CREATED             STATUS                    PORTS                                                                                     NAMES
0df916b918a9        severalnines/centos-ssh       "/entrypoint.sh"    4 seconds ago       Up 3 seconds              22/tcp, 3306/tcp, 9999/tcp, 27107/tcp, 0.0.0.0:5553->5432/tcp                             postgres3
4c1829371b5e        severalnines/centos-ssh       "/entrypoint.sh"    11 seconds ago      Up 10 seconds             22/tcp, 3306/tcp, 9999/tcp, 27107/tcp, 0.0.0.0:5552->5432/tcp                             postgres2
79d4263dd7a1        severalnines/centos-ssh       "/entrypoint.sh"    32 seconds ago      Up 31 seconds             22/tcp, 3306/tcp, 9999/tcp, 27107/tcp, 0.0.0.0:5551->5432/tcp                             postgres1
7eadb6bb72fb        severalnines/clustercontrol   "/entrypoint.sh"    38 minutes ago      Up 38 minutes (healthy)   22/tcp, 443/tcp, 3306/tcp, 9500-9501/tcp, 9510-9511/tcp, 9999/tcp, 0.0.0.0:5000->80/tcp   clustercontrol

We need to know the IP Address for each database node. To know it, we can run the following command for each node:

$ docker inspect [db-container] |grep IPAddress
            "IPAddress": "172.17.0.3",

Now we have the server nodes up and running, we need to deploy our database cluster. To make it in an easy way we’ll use ClusterControl.

To perform a deployment from ClusterControl, open the ClusterControl UI at http://[Docker_Host]:5000/clustercontrol, then select the option “Deploy” and follow the instructions that appear.

When selecting PostgreSQL, we must specify User, Key or Password and port to connect by SSH to our servers. We also need a name for our new cluster and if we want ClusterControl to install the corresponding software and configurations for us.

After setting up the SSH access information, we must define the database user, version and datadir (optional). We can also specify which repository to use.

In the next step, we need to add our servers to the cluster that we are going to create.

When adding our servers, we can enter IP or hostname. Here we must use the IP Address that we got from each container previously.

In the last step, we can choose if our replication will be Synchronous or Asynchronous.

We can monitor the status of the creation of our new cluster from the ClusterControl activity monitor.

Once the task is finished, we can see our cluster in the main ClusterControl screen.

Conclusion

As we could see, the deploy of PostgreSQL with Docker could be easy at the beginning but it’ll require a bit more work to configure replication. Finally, you should monitor your cluster to see what is happening. With ClusterControl, you can import or deploy your PostgreSQL cluster with Docker, as well as automate the monitoring and management tasks like backup and automatic failover/recovery. Try it out.

Viewing all 9649 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>