Kaarel Moppel: Are triggers really that slow in Postgres?

May 22, 2018, 2:13 am

≫ Next: Craig Kerstiens: Preparing your multi-tenant app for scale

≪ Previous: Andrew Dunstan: PostgreSQL Buildfarm Client Release 8

First of course the big question – should we be using good old triggers at all? Well, actually I’m not going to recommend anything here as it’s an opinionated topic:) People well-versed in databases would probably see good use cases for them whereas modern application developers would mostly say it’s an awful practice – doing some “magic stuff” kind of secretly. Avoiding that holy war, let’s say here that we already need to use triggers – boss’s orders. But then comes the question – should we be afraid of using triggers in Postgres due to possible performance penalties? Should we plan to beef up the hardware or do some app optimizations beforehand? From my personal experience – no, mostly nothing will change in the big picture if “used moderately”. But let’s try to generate some numbers as that’s where the truth lives…

A pgbench test scenario with triggers

All applications are generally unique, so the most critical part of usable performance tests boils down to actually setting up a more or less plausible use case. As Postgres comes bundled with the quick benchmarking tool pgbench, I usually tend to take it’s schema as a baseline and do some modifications on that based on the type of application that the customer has. In the case of deploying triggers the most usual use case is probably “auditing” – making sure on database level that we store some data on the author/reason for those changes on all rows. So to simulate such basic auditing I decided to just add two audit columns for all pgbench tables receiving updates (3 of them) in the default transaction mode. So let’s create last_modified_on (as timestamp) and last_modified_by (as text) and then fill them with triggers – basically something like that:

...
ALTER TABLE pgbench_(accounts|branches|tellers)
	ADD COLUMN last_modified_on timestamptz,
	ADD COLUMN last_modified_by text;
...
CREATE FUNCTION trg_last_modified_audit() RETURNS TRIGGER AS
$$
	IF NEW.last_modified_by IS NULL THEN
		NEW.last_modified_by = session_user;
	END IF;
	IF NEW.last_modified_on IS NULL THEN
		NEW.last_modified_on = current_timestamp;
	END IF;
	RETURN NEW;
$$ LANGUAGE plpgsql;
...

Hardware / Software

Next I booted up a moderately specced (2 CPU, 4 GB RAM, 48 GB SSD) test machine on Linode (Ubuntu 18.04) and installed the latest Postgres (v10.4) binaries from the official Postgres project managed repository leaving all postgresql.conf settings except shared_buffers (which I set at 3GB) at default. For the pgbench scaling factor number I chose 100, giving us a ~1.3GB database (see here for more how to choose those scale numbers) so that everything is basically cached and we can factor out most IO jitter – checkpoints, background writer and autovacuum are still kind of random of course, but typically they’re there also for real-life systems so not sure if removing them is a good idea.

NB! For carrying out the actual testing I then compiled the latest (11devel) pgbench to make use of repeatable test cases (the new –random-seed parameter!), initialized the schema and ran the simplest possible pgbench test for triggers/untriggered case for 2h with 3 loops. Basically something like seen below (for the full script see here).

pgbench -i -s 100
pgbench -T 7200 --random-seed=2018

Summary on simple row modification trigger performance

So the first test I did compared pgbench transactions “as is” with just 2 auditing columns added for the 3 tables getting updates vs the “triggered” case where for each of those tables also a trigger was installed that was setting the auditing timestamp/username. The avg. transaction latency results came back kind of as expected: 1.173 ms vs 1.178 ms i.e. 0.4% difference– meaning basically no difference in average transaction latency at all for transactions where 3 simple data checking/filling triggers are executed in the background!

To reiterate: if having a typical OLTP transaction touching a couple of tables, PL/PgSQL triggers containing just simple business logic can be used without further performance considerations!

Hmm…but how many simple triggers would you then need to see some noticeable runtime difference in our use case? Probably at least a dozen! In a typical short lived OLTP transaction context we’re still mostly IO (largely disk fsync speed) and especially for multi-statement transactions also network (round-trip) bound…thus worrying on some extra CPU cycles spent in triggers can be spared.

So what could be tried more to get an idea of penalties resulting from triggers? First we could make the transactions thinner by getting rid of network round trip latency – a single UPDATE on the pgbench_accounts would be good for that in our case. Then we could also let triggers insert some data into some other tables…but that’s enough content for another blog post I believe. See you soonish!

The post Are triggers really that slow in Postgres? appeared first on Cybertec.

↧

Craig Kerstiens: Preparing your multi-tenant app for scale

May 22, 2018, 7:09 am

≫ Next: Luca Ferrari: pg_chocolate (aka the end of course in Modena)

≪ Previous: Kaarel Moppel: Are triggers really that slow in Postgres?

We spend a lot of time with companies that are growing fast, or planning for future growth. It may be you’ve built your product and are now just trying to keep the system growing and scaling to handle new users and revenue. Or you may be still building the product, but know that and even moderate level of success could lead to a lot of scaling. In either case where you spend your time is key in order to not lose valuable time.

As Donald Knuth states it in Computer Programming as an Art:

“Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.”

With the above in mind one of the most common questions we get is: What do I need to do now to make sure I can scale my multi-tenant application later?

We’ve written some before about approaches not to take such as schema based sharding or one database per customer and the trade-offs that come with that approach. Here we’ll dig into three key steps you should take that won’t be wasted effort should the need to scale occur.

Denormalize earlier to scale later

In the world of databases you can learn all about normalization. Normalization has various forms but in simple terms is not duplicating data, rather each unique piece of data has a key to identify it and you reference that key that has the association to that data. In the reporting world denormalization can be extremely common, allowing you to generate reports faster. For multi-tenant applications going somewhere middle of the road is ideal, the best step you can take early is ensuring you have your tenant_id on every single table. Suppose you a basic CRM schema:

CREATETABLEleads(idserialprimarykey,first_nametext,last_nametext,emailtext);CREATETABLEaccounts(idserialprimarykey,nametext,statevarchar(2),sizeint);CREATETABLEopportunity(idserialprimarykey,nametext,amountint);

To plan to scale later a key step would be to add the tenant_id onto every table, in this case it’ll be org_id:

CREATETABLEleads(idserialprimarykey,first_nametext,last_nametext,emailtext,org_idint);CREATETABLEaccounts(idserialprimarykey,nametext,statevarchar(2),sizeint,org_idint);CREATETABLEopportunity(idserialprimarykey,nametext,amountint,org_idint);

Adapt your keys to leverage the multi-tenant schema

In addition to adding the tenant_id it’s a good practice to then use it as part of your primary and foreign keys. Good thing for you composite keys work great in Postgres. Instead of putting your primary key then on a single column you specify it at the end of the table creation:

CREATETABLEleads(idserial,first_nametext,last_nametext,emailtext,org_idint,primarykey(org_id,id));

By denormalizing a little and updating your keys to be composite ones you’re in pretty good shape to scale out when the time comes. But there is one more bit of work you can put in early that won’t be wasted.

Manage your Postgres connections before they manage you

A single connection to a Postgres database consumes roughly 10 MB of overhead. Most web application frameworks such as Rails and Django keep a pool of connections or persist them so when they have a new request there is less time spent connecting to the database. This reduction in time comes at the cost of using more resources from your database. When you’re early this may be an okay trade-off, but if you ever need to scale this will become a bottleneck.

The ideal is to setup proper connection management via something like a third party connection pooler such as pgbouncer. Even if you’re not ready to put in pgbouncer today you should be monitoring both your active and idle connections to make sure this doesn’t sneak up on you.

Good investments mean good returns

The best part on each of the above steps is that by doing them early, the effort is minimal. To refactor lator, the time it will take becomes proportional to how large and complex your app has grown. The best technical investments are ones that can be made gradually with minimal effort, but still provide compounding gains over the long term. If you have questions on planning for scale or are already running into scaling issues with your Postgres database, feel free to reach out to our database team here at Citus to see if we can help.

↧

Luca Ferrari: pg_chocolate (aka the end of course in Modena)

May 22, 2018, 5:00 pm

≫ Next: Pavel Stehule: How to don't use PL/pgSQL and other fatal errors when PL/pgSQL is used

≪ Previous: Craig Kerstiens: Preparing your multi-tenant app for scale

Yesterday it was my last session at the local Linux Users’ Group ConoscereLinux course on PostgreSQL. Since the LUG gave me the chance to deliver an extra session, and since all the attendees were fun and nice, I decide to “contribute back”.

`pg_chocolate`

No, this is not a new project about our favourite database.
I asked my great wife to make a chocolate cake to share with all attendees, and we came up with the idea of putting the elephant logo on top of it.

Let’s say it was not difficult for my wife to produce the elephant cake, that bite after bite disappeared…you know, even complex queries become simpler in front of a good cake! Chances are there will be a /replication/ instance coming up sooner or later!

pg_chocolate_1 pg_chocolate_2

↧

Pavel Stehule: How to don't use PL/pgSQL and other fatal errors when PL/pgSQL is used

May 22, 2018, 8:48 pm

≫ Next: Joshua Otwell: Top PostgreSQL Security Threats

≪ Previous: Luca Ferrari: pg_chocolate (aka the end of course in Modena)

I wrote article Jak nepoužívat PL/pgSQL (případně PL/SQL) – fatální chyby při vývoji. The article is in Czech language, but google translator can be used.

↧

Joshua Otwell: Top PostgreSQL Security Threats

May 23, 2018, 2:58 am

≫ Next: Dimitri Fontaine: PostgreSQL Data Types

≪ Previous: Pavel Stehule: How to don't use PL/pgSQL and other fatal errors when PL/pgSQL is used

Modern databases store all kinds of data. From trivial to highly sensitive. The restaurants we frequent, our map locations, our identity credentials, (e.g., Social Security Numbers, Addresses, Medical Records, Banking info, etc...), and everything in between is more than likely stored in a database somewhere. No wonder data is so valuable.

Database technologies advance at a breakneck pace. Innovation, progression, integrity, and enhancements abound are at the forefront as a direct result of the labors of intelligent and devoted engineers, developers, and robust communities supporting those vendors.

Yet there is another side to the coin. That, unfortunately, co-exists within this data-driven world in the form of malware, viruses, and exploits on a massive, all-time high scale.

Data is valuable to the parties on that side of the operation as well. But for different reasons. Any of them could be but are not limited to power, blackmail, financial gain and access, control, fun, pranks, malice, theft, revenge... You get the idea. The list is endless.

Alas, we have to operate with a security mindset. Without this mindset, we leave our systems vulnerable to these types of attacks. PostgreSQL is just as susceptible to compromise, misuse, theft, unauthorized access/control as other forms of software.

So What Measures Can We Take to Mitigate the Number of Risks to Our PostgreSQL Installs?

I strongly feel that promoting awareness of what known threats are out there, is as good a place to start as any. Knowledge is power and we should use all available at our disposal. Besides, how can we police for that we are not even aware of in order to tighten up security on those PostgreSQL instances and protect the data residing there?

I recently searched out known security 'concerns' and 'threats', targeting the PostgreSQL environment. My search encompassed recent reports, articles, and blog posts within the first quarter of 2018. In addition to that specific time frame, I explored well-known long-standing concerns that are still viable threats today (namely SQL Injection), while not polished or brandished as 'recently discovered'.

A Photo Opportunity

A Deep Dive into Database Attacks [Part III]: Why Scarlett Johansson's Picture Got My Postgres Database to Start Mining Monero

Word of this crafty malware attack returned the most 'hits' out of my objective search results.

We'll visit one of several great blog posts and a rundown of its content. I've also included additional blog posts towards the end of this section so be sure and visit those as well detailing this intrusion.

Observations

Information from Imperva, reports their honeypot database (StickyDB) discovered a malware attack on one of their PostgreSQL servers. The honeypot net, as Imperva names the system, is designed to trick attackers into attacking the database so they (Imperva) can learn about it and become more secure. In this particular instance, the payload is a malware that cryptomines Monero, embedded in a photo of Scarlett Johansson.

The payload is dumped to disk at runtime with the lo_export function. But apparently, this happens because lo_export is an entry in pg_proc versus normally direct calling (lo_export).

Interesting details directly from the blog post here for extreme clarity (see cited article),

Now the attacker is able to execute local system commands using one simple function – fun6440002537. This SQL function is a wrapper for calling a C-language function, “sys_eval”, a small exported function in “tmp406001440” (a binary based on sqlmapproject), which basically acts as proxy to invoke shell commands from SQL client.

So what will be next steps of the attack? Some reconnaissance. So it started with getting the details of the GPU by executing lshw -c video and continued to cat /proc/cpuinfo in order to get the CPU details (Figures 3-4). While this feels odd at first, it makes complete sense when your end goal is to mine more of your favorite cryptocurrency, right?

With a combination of database access and the ability to execute code remotely, all while 'flying under the radar' of monitoring solutions, the trespasser then downloads the payload via a photo of Scarlett Johansson.

(Note: The photo has since been removed from its hosted location. See linking article for the mention.)

According to the report, the payload is in binary format. That binary code was appended into the photo in order to pass for an actual photo during upload, allowing for a viewable picture.

See Figure 6 of the post for the SQL responsible for utilizing wget, dd, and executing chmod for permissions on the downloaded file. That downloaded file then creates another executable which is responsible for actually mining the Monero. Of course, housekeeping and cleanup are needed after all this nefarious work.

Figure 7 depicts the performing SQL.

Imperva recommends monitoring this list of potential breach areas in the closing section:

Watch out of direct PostgreSQL calls to lo_export or indirect calls through entries in pg_proc.
Beware of PostgreSQL functions calling to C-language binaries.
Use a firewall to block outgoing network traffic from your database to the Internet.
Make sure your database is not assigned with public IP address. If it is, restrict access only to the hosts that interact with it (application server or clients owned by DBAs).

Imperva also performed various antivirus tests along with details of how attackers can potentially locate vulnerable PostgreSQL servers. Although I did not include them here for brevity, consult the article for full details of their findings.

CVE Details, Report, and Vulnerabilities

I visited this site, which posts latest security threats on a per vendor basis and discovered 4 vulnerabilities in Q1 of 2018. The PostgreSQL Security Information page also has them listed so feel free to consult that resource.

Although most all of them have been addressed, I felt it important to include these in this post to bring awareness to readers who may not have known about them. I feel we can learn from all of them. Especially in the different ways of discovered vulnerabilities.

They are listed below in the order of date published:

I. CVE-2018-1052 date published 2018-02-09 : Update Date 3/10/2018

Overview:

Memory disclosure vulnerability in table partitioning was found in PostgreSQL 10.x before 10.2, allowing an authenticated attacker to read arbitrary bytes of server memory via purpose-crafted insert to a partitioned table.

This vulnerability was fixed with the release of PostgreSQL 10.2 confirmed here. Older 9.x version also fixed are mentioned as well so visit that link to check your specific version.

II. CVE-2018-1053 date published 2018-02-09 : Update Date 3/15/2018

Overview:

In PostgreSQL 9.3.x before 9.3.21, 9.4.x before 9.4.16, 9.5.x before 9.5.11, 9.6.x before 9.6.7 and 10.x before 10.2, pg_upgrade creates file in current working directory containing the output of `pg_dumpall -g` under umask which was in effect when the user invoked pg_upgrade, and not under 0077 which is normally used for other temporary files. This can allow an authenticated attacker to read or modify the one file, which may contain encrypted or unencrypted database passwords. The attack is infeasible if a directory mode blocks the attacker searching the current working directory or if the prevailing umask blocks the attacker opening the file.

As with the previous CVE-2018-1052, PostgreSQL 10.2 fixed this portion of the vulnerability:

Ensure that all temporary files made with "pg_upgrade" are non-world-readable

Many older versions of PostgreSQL are affected by this vulnerability. Be sure and visit the provided link for all those listed versions.

III. CVE-2017-14798 date published 2018-03-01 : Update Date 3/26/2018

Overview:

A race condition in the PostgreSQL init script could be used by attackers able to access the PostgreSQL account to escalate their privileges to root.

Although I could not find anywhere on the linking page that PostgreSQL version 10 was mentioned, many older versions are, so visit that link if running older versions.

Suse Linux Enterprise Server users may be interested in 2 linked articles here and here where this vulnerability was fixed for version 9.4 init script.

IV. CVE-2018-1058 date published 2018-03-02 : Update Date 3/22/2018

Overview:

A flaw was found in the way PostgreSQL allowed a user to modify the behavior of a query for other users. An attacker with a user account could use this flaw to execute code with the permissions of superuser in the database. Versions 9.3 through 10 are affected.

This update release mentions this vulnerability with an interesting linked document all users should visit.

The article provides a fantastic guide from the community titled A Guide to CVE-2018-1058: Protect Your Search Path that has an incredible amount of information concerning the vulnerability, risks, and best practices for combating it.

I'll do my best to summarize, but visit the guide for your own benefit, comprehension, and understanding.

Overview:

With the advent of PostgreSQL version 7.3, schemas were introduced into the ecosystem. This enhancement allows users to create objects in separate namespaces. By default, when a user creates a database, PostgreSQL also creates a public schema in which all new objects are created. Users who can connect to a database can also create objects in that databases public schema.

This section directly from the guide is highly important (see cited article):

Schemas allow users to namespace objects, so objects of the same name can exist in different schemas in the same database. If there are objects with the same name in different schemas and the specific schema/object pair is not specified (i.e. schema.object), PostgreSQL decides which object to use based on the search_path setting. The search_path setting specifies the order the schemas are searched when looking for an object. The default value for search_path is $user,public where $user refers to the name of the user connected (which can be determined by executing SELECT SESSION_USER;).

Another key point is here:

The problem described in CVE-2018-1058 centers around the default "public" schema and how PostgreSQL uses the search_path setting. The ability to create objects with the same names in different schemas, combined with how PostgreSQL searches for objects within schemas, presents an opportunity for a user to modify the behavior of a query for other users.

Below is a high-level list the guide recommends application of these practices as stipulated to reduce the risk of this vulnerability:

Do not allow users to create new objects in the public schema
Set the default search_path for database users
Set the default search_path in the PostgreSQL configuration file (postgresql.conf)

SQL Injection

Any 'security-themed' SQL blog post or article cannot label itself as such without mention of SQL injection. While this method of attack is by no stretch of the imagination 'the new kid on the block', it has to be included.

SQL Injection is always a threat and perhaps even more so in the web application space. Any SQL database -including PostgreSQL- is potentially vulnerable to it.

While I don't have a deep knowledge base on SQL Injection -also known as SQLi- I'll do my best to provide a brief summary, how it can potentially affect your PostgreSQL server, and ultimately how to reduce the risks of falling prey to it.

Refer to the links provided towards the end of this section, all of which contain a wealth of information, explanation, and examples in those areas I am unable to adequately communicate.

Unfortunately, several types of SQL injections exist and they all share the common goal of inserting offensive SQL into queries for execution in the database, perhaps not originally intended nor designed by the developer.

Unsanitized user input, poorly designed or non-existent type checking (AKA validation), along with unescaped user input all can potentially leave the door wide open for would-be attackers. Many web programming API's provide some protection against SQLi: e.g., ORM's(Object Relational Mapper), parameterized queries, type checking, etc.... However, it is the developer's responsibility to make every effort and reduce prime scenarios for SQL injection by implementing those diversions and mechanisms at their disposal.

Here are notable suggestions to reduce the risk of SQL injection from the OWASP SQL Injection Prevention Cheat Sheet. Be sure and visit it for complete detailing example uses in practice (see cited article).

Primary Defenses:

Option 1: Use of Prepared Statements (with Parameterized Queries)
Option 2: Use of Stored Procedures
Option 3: White List Input Validation
Option 4: Escaping All User Supplied Input

Additional Defenses:

Also: Enforcing Least Privilege
Also: Performing White List Input Validation as a Secondary Defense

Recommended Reading:

I’ve included additional articles with a load of information for further study and awareness:

PostgreSQL Management & Automation with ClusterControl

Learn about what you need to know to deploy, monitor, manage and scale PostgreSQL

Download the Whitepaper

Postgres Role Privileges

We have a saying for something along the lines of "We are our own worst enemy."

We can definitely apply it to working within the PostgreSQL environment. Neglect, misunderstanding, or lack of diligence are just as much an opportunity for attacks and unauthorized use as those purposely launched.

Perhaps even more so, inadvertently allowing easier access, routes, and channels for offending parties to tap into.

I’ll mention an area that always needs revaluation or reassessment from time to time.

Unwarranted or extraneous role privileges.

SUPERUSER
CREATROLE
CREATEDB
GRANT

This amalgamation of privileges is definitely worth a look. SUPERUSER and CREATROLE are extremely powerful commands and would be better served in the hands of a DBA as opposed to an analyst or developer wouldn't you think?

Does the role really need the CREATEDB attribute? What about GRANT? That attribute has the potential for misuse in the wrong hands.

Heavily weigh all options prior to allowing roles these attributes in your environment.

Strategies, Best Practices and Hardening

Below is a list of useful blog posts, articles, checklists, and guides that returned for a 'year back' (at the time of this writing) of search results. They are not listed in any order of importance and each offer noteworthy suggestions.

Conclusion

My hope is with the information provided in this blog post, along with the robust community, we can stay at the forefront of threats against the PostgreSQL database system.

Tags:

PostgreSQL

database security

security threats

hackers

↧

Dimitri Fontaine: PostgreSQL Data Types

May 24, 2018, 5:47 am

≫ Next: Ozgun Erdogan: Citus 7.4: Move fast and reduce technical debt

≪ Previous: Joshua Otwell: Top PostgreSQL Security Threats

Today it’s time to conclude our series of PostgreSQL Data Types articles with a recap. The series cover lots of core PostgreSQL data types and shows how to benefit from the PostgreSQL concept of a data type: more than input validation, a PostgreSQL data type also implements expected behaviors and processing functions.

This allows an application developer to rely on PostgreSQL for more complex queries, having the processing happen where the data is, for instance when implementing advanced JOIN operations, then retrieving only the data set that is interesting for the application.

↧

Ozgun Erdogan: Citus 7.4: Move fast and reduce technical debt

May 24, 2018, 6:50 am

≫ Next: Hans-Juergen Schoenig: CREATE VIEW vs. ALTER TABLE in PostgreSQL

≪ Previous: Dimitri Fontaine: PostgreSQL Data Types

Today, we’re excited to announce the latest release of our distributed database, Citus 7.4! Citus scales out PostgreSQL through sharding, replication, and query parallelization.

Ever since we open sourced Citus as a Postgres extension, we have been incorporating your feedback into our database. Over the past two years, our release cycles went down from six to four to two months. As a result, we have announced 10 new Citus releases, where each release came with notable new features.

Shorter release cycles and more features came at a cost however. In particular, we added new distributed planner and executor logic to support different use cases for multi-tenant applications and real-time analytics. However, we couldn’t find the time to refactor this new logic. We found ourselves accumulating technical debt. Further, our distributed SQL coverage expanded over the past two years. With each year, we ended spending more and more time on testing each new release.

In Citus 7.4, we focused on reducing technical debt related to these items. At Citus, we track our development velocity with each release. While we fix bugs in every release, we found that a full release focused on addressing technical debt would help to maintain our release velocity. Also, a cleaner codebase leads to a happier and more productive engineering team.

Refactoring distributed query planners in Citus

When you shard a relational database, you’re taking diverse features built over decades and applying them onto a distributed system. The challenge here is that these diverse features were built with a single machine in mind. This means cpu, memory, and disk. When you take any of these features, and execute them over a distributed system, you have a new bottleneck: the network.

So how do handle these diverse features in a distributed environment? To accommodate for these differences, Citus uses a “layered approach” to planning. When a query comes into the database, Citus picks the most efficient planner among four different planners. You can learn more about Citus planners from a talk Marco delivered recently at PGConf US.

Citus Planner	Use Case
Router planner	Multi-tenant (B2B) / OLTP
Pushdown planner	Real-time analytics / search
Recursive (subquery/CTE) planner	Real-time analytics / data warehouse
Logical planner	Data warehouse

In this diagram, the four planners look cleanly separated. In practice, we added new features that make use of these planners over time. As a result, we ended up having features that leaked between the four distributed planner layers. For example, if you wanted to insert a single row, it was easy to route that request to the proper shard. Similarly, if you wanted to execute an analytics query, Citus would easily pick the appropriate planner. Citus would then plan the query for distributed execution and push down the work to related shards.

What happens when you have a query that both updates rows and also analyzes them? For these types of queries, Citus planning logic would leak between the four planner types.

UPDATEaccountsSETdept=select_query.max_dept*2FROM(SELECTDISTINCTON(tenant_id)tenant_id,max(dept)asmax_deptFROM(SELECTaccounts.dept,accounts.tenant_idFROMaccounts,companyWHEREcompany.tenant_id=accounts.tenant_id)select_query_innerGROUPBYtenant_idORDERBY1DESC)asselect_queryWHEREselect_query.tenant_id!=accounts.tenant_idANDaccounts.deptIN(2)RETURNINGaccounts.tenant_id,accounts.dept;

In Citus 7.4, we refactored our planning logic for DML statements (INSERT, UPDATE, DELETE) that included SELECT statements within them. This refactoring both increased the type of statements Citus could handle. It also reduced dependencies between the different planner components, enabling us to make changes to each planner in the future easier.

Complex bugs due to incremental changes / code duplication

As Citus expanded into new workloads, we added new planner components to meet new requirements. At the same time, Citus also notably expanded its SQL coverage for analytical workloads.

Since Citus 5, we added new features mostly as incremental changes to existing planner code. Sometimes, we made these changes in short notice because they ended up being critical for a customer.

As we went through our release testing, we found that these incremental features worked well in isolation. However, they could lead to bugs when used in combination.

For example, the following query didn’t pass our acceptance tests. Part of the issue with this query was how Citus handled the combination of DISTINCT ON clauses, aggregate and window functions, window functions in the ORDER BY clause, HAVING clause, and a LIMIT clause.

SELECTDISTINCTON(AVG(value_1)OVER(PARTITIONBYuser_id))user_id,AVG(value_2)OVER(PARTITIONBYuser_id)FROMusers_tableGROUPBYuser_id,value_1,value_2HAVINGcount(*)=1ORDERBY(AVG(value_1)OVER(PARTITIONBYuser_id)),2DESC,1LIMIT100;

Another part of the issue was that Citus handled the SQL features in this query across two planner components. Depending on the sharding key, Citus would decide which parts of the query it could push down to the machines in the cluster. This lead to code duplication and complex interactions between the planner components.

In Citus 7.4, we refactored the distributed query planner to remove these issues. As part of this change, we moved all the logic related to executing the above analytical clauses into one place. In the process, we removed code that ended up being duplicated across different components. Last, we made the dependencies between the above clauses explicit. This way, as we add new clauses that come with new Postgres clauses, we’ll have a faster and easier time integrating them.

Features that became obsolete with new requirements

Citus’ early versions primarily targeted analytical workloads. In these workloads, users would shard their large tables. For the small tables, Citus would take a more dynamic approach. When the user issued an analytical query, Citus would then broadcast this small table’s data to all the machines in the cluster. We referred to this operation as broadcast joins.

Broadcast joins worked well for customers who loaded their data in batches and at regular intervals. However, as we had more customers who needed to update these tables in real-time, the notion of a broadcast join became problematic. We first tried to mitigate this problem by introducing smarter caching of shards for broadcast joins.

During our Citus 6 release, we realized that broadcast joins were no longer tenable. So, we deprecated broadcast joins in favor of reference tables. With reference tables, the user would explicitly tell Citus that these tables were small. Citus would then replicate this table’s shards across all machines in the cluster and propagate updates to them immediately.

reference table

The category table is a reference table. The category table has a single shard that’s replicated across all machines in the cluster.

Because broadcast joins were deeply integrated into our planning logic, we never got around to cleaning them up. Over time, broadcast joins confused new team members. It also made changes to our planners harder. New features would break backwards compatibility with broadcast joins, so we ended up waiting on them.

Riot Games’ blog post on Taxonomy of Technical Debt identifies this type of technical debt as contagious (contagion). In Citus 7.4, we removed all legacy code related to broadcast joins. This reduced the amount of planner code we need to think about when implementing new features. It also resolved seven legacy issues related to broadcast joins.

Reducing time spent on release testing

We expect that each pull request in the Citus repository also comes with its related tests. Then, we assign a reviewer to code review the pull request. The author and the reviewer are also responsible for personal desk-checking of code.

Before each new Citus release, we then go through a release testing process. This process enables us to test new changes not only in isolation, but also in combination. We also compiled a checklist of functionality and performance tests over the years. This checklist helps us maintain the same level of product quality over releases.

For example, the following table captures the sets of tests we ran before releasing Citus 7.4.

Item	Estimated (dev days)	Actual(dev days)
- Run steps in documentation - Test app integration	1	2
Integration tests	1	1.5
Test new features	4	4
Failure testing	4	2.5
Concurrency testing	1.5	1.5
Performance testing - Analytical workloads - Transactional workloads	.5	1
Scale testing	.5	.5
Query cancellation testing	1	1
Valgrind (memory) testing	.25	1
Upgrade to new version - test	1.5	1.75
Enterprise feature testing	3	2.5
Total	18.25	19.25

Over time, we started running into two challenges with release testing. First, as Citus expanded into new use-cases, the surface area of functionality that we needed to test grew with it. As a result, we ended up spending more and more time with release testing.

Second, we made improvements to our release process to reduce our release times from six to four to two months. Consequently, release testing ended up taking a longer portion of each release cycle.

To mitigate these challenges, we first reviewed our release testing process. We then made three improvements over the past year. First, we found test processes that we manually repeated in each release and automated them - such as integration and performance testing.

Second, we found that our regression tests framework caught the most bugs and invested into making it better. Third, we realized that we spent the majority of our time with concurrency testing. Fortunately, Postgres also comes with an isolation test framework. We adopted this framework into Citus and started automating our distributed concurrency tests.

These improvements left one group of tests as a serious offender: failure testing. One of the challenges with distributed databases is that each component can fail. We realized that we were spending a notable amount of time in each release generating machine and network failures. Further, most bugs that surfaced in production were also in this bucket. Often, our customers would report a bug that showed up as a result of a combination of failures in the distributed system.

With Citus 7.4, we’re introducing an automated failure testing framework for Citus. We expect this framework to reduce the amount of time we spend in release testing. Further, we expect to codify typical machine and network failure scenarios into this framework. By doing so, we can catch bugs earlier and provide a better experience to our customers.

Reduce technical debt to move fast

Technical debt is a widely discussed topic between product and engineering teams. On one hand, you want to deliver a product with more features to your customers, fast. On the other hand, these features add complexity to your codebase. You need to keep this complexity in check so that you can bring new developers up to speed and continue to add new features at a fast pace.

At Citus, we see technical debt as maintaining a balance between product and engineering. In each release, we reserve a portion of our time in fixing issues related to technical debt. At the same time, we released ten new versions after open sourcing Citus two years ago. This lead to more features and more code. So, keeping this complexity in check became important in maintaining our release velocity in the long term.

In Citus 7.4, we focused on reducing technical debt related to our planner logic, code paths that became obsolete over the years, and release test times. For a full list of changes, please see the list in our Github repo.

Share your feedback on Citus 7.4 with us

Citus 7.4 scales out Postgres through sharding, replication, and query parallelization. This latest release of our Citus database also comes with improvements that will make it easier for us to add new features in the future. As always, Citus is available as open source, as enterprise software you can run anywhere, and as a fully-managed database as a service.

If you give Citus a try, we’d love to hear your feedback. Please join the conversation in our Slack channel and let us know what you think.

↧

Hans-Juergen Schoenig: CREATE VIEW vs. ALTER TABLE in PostgreSQL

May 24, 2018, 2:14 am

≫ Next: Michael Paquier: TAP tests and external modules

≪ Previous: Ozgun Erdogan: Citus 7.4: Move fast and reduce technical debt

In PostgreSQL, a view is a virtual table based on an SQL statement. It is an abstraction layer, which allows to access the result of a more complex SQL fast an easily. The fields in a view are fields from one or more real tables in the database. The question many people now ask if: If a view is based on a table. What happens if the data structure of the underlying table changes?

CREATE VIEW in PostgreSQL

To show what PostgreSQL will do, I created a simple table:

view_demo=# CREATE TABLE t_product
(
        id         serial,
        name       text,
        price      numeric(16, 4)
);
CREATE TABLE

My table has just three simple columns and does not contain anything special. Here is the layout of the table:

view_demo=# \d t_product
  Table "public.t_product"
 Column |     Type      | Collation | Nullable | Default
--------+---------------+-----------+----------+---------------------------------------
 id     | integer       |           | not null | nextval('t_product_id_seq'::regclass)
 name   | text          |           |          |
 price  | numeric(16,4) |           |          |

Making changes to tables and views

The first thing to do in order to get our demo going is to create a view:

view_demo=# CREATE VIEW v AS SELECT * FROM t_product;
CREATE VIEW

The important thing here to see is how PostgreSQL handles the view. In the following listing you can see that the view definition does not contain a “*” anymore. PostgreSQL has silently replaced the “*” with the actual column list. Note that this is an important thing because it will have serious implications:

view_demo=# \d+ v
  View "public.v"
 Column | Type          | Collation | Nullable | Default | Storage  | Description
--------+---------------+-----------+----------+---------+----------+-------------
 id     | integer       |           |          |         | plain    |
 name   | text          |           |          |         | extended |
 price  | numeric(16,4) |           |          |         | main     |
View definition:
  SELECT t_product.id,
         t_product.name,
         t_product.price
  FROM t_product;

What happens if we simply try to rename the table the view is based on:

view_demo=# ALTER TABLE t_product RENAME TO t_cool_product;
ALTER TABLE

view_demo=# \d+ v
View "public.v"
 Column | Type          | Collation | Nullable | Default | Storage  | Description
--------+---------------+-----------+----------+---------+----------+-------------
 id     | integer       |           |          |         | plain    |
 name   | text          |           |          |         | extended |
 price  | numeric(16,4) |           |          |         | main     |
View definition:
  SELECT t_cool_product.id,
         t_cool_product.name,
         t_cool_product.price
  FROM t_cool_product;

As you can see the view will be changed as well. The reason for that is simple: PostgreSQL does not store the view as string. Instead if will keep a binary copy of the definition around, which is largely based on object ids. The beauty is that if the name of a table or a column changes, those objects will still have the same object id and therefore there is no problem for the view. The view will not break, become invalid or face deletion.

The same happens when you change the name of a column:

view_demo=# ALTER TABLE t_cool_product
RENAME COLUMN price TO produce_price;
ALTER TABLE

Again the view will not be harmed:

view_demo=# \d+ v
  View "public.v"
 Column | Type          | Collation | Nullable | Default | Storage  | Description
--------+---------------+-----------+----------+---------+----------+-------------
 id     | integer       |           |          |         | plain    |
 name   | text          |           |          |         | extended |
 price  | numeric(16,4) |           |          |         | main     |
View definition:
   SELECT t_cool_product.id,
          t_cool_product.name,
          t_cool_product.produce_price AS price
   FROM t_cool_product;

What is really really important and noteworthy here is that the view does not change its output. The columns provided by the view will be the same. In other words: Application relying on the view won’t break just because some other column has changed somewhere.

What PostgreSQL does behind the scenes

Behind the scenes a view is handled by the rewrite system. In the system catalog there is a table called pg_rewrite, which will store a binary representation of the view:

view_demo=# \d pg_rewrite
  Table "pg_catalog.pg_rewrite"
 Column     | Type         | Collation | Nullable | Default
------------+--------------+-----------+----------+---------
 rulename   | name         |           | not null |
 ev_class   | oid          |           | not null |
 ev_type    | "char"       |           | not null |
 ev_enabled | "char"       |           | not null |
 is_instead | boolean      |           | not null |
 ev_qual    | pg_node_tree |           |          |
 ev_action  | pg_node_tree |           |          |
Indexes:
  "pg_rewrite_oid_index" UNIQUE, btree (oid)
  "pg_rewrite_rel_rulename_index" UNIQUE, btree (ev_class, rulename)

Basically this is an internal thing. However, I decided to show, how it works behind the scenes as it might be interesting to know.

Views and dropping columns

However, in some cases PostgreSQL has to error out. Suppose somebody wants to drop a column, on which a view depends on. In this case PostgreSQL has to error out because it cannot silently delete the column from the view.

view_demo=# ALTER TABLE t_cool_product DROP COLUMN name;
ERROR: cannot drop table t_cool_product column name because other objects depend on it
DETAIL: view v depends on table t_cool_product column name
HINT: Use DROP ... CASCADE to drop the dependent objects too.

In this case PostgreSQL complains that the view cannot be kept around because columns are missing. You can now decide whether to not drop the column or whether to drop the view along with the column.

The post CREATE VIEW vs. ALTER TABLE in PostgreSQL appeared first on Cybertec.

↧

Michael Paquier: TAP tests and external modules

May 25, 2018, 2:25 am

≫ Next: Achilleas Mantzios: PostgreSQL Audit Logging Best Practices

≪ Previous: Hans-Juergen Schoenig: CREATE VIEW vs. ALTER TABLE in PostgreSQL

PostgreSQL TAP test has become over the last couple of years an advanced facility gaining in more and more features which allow for complicated regression test scenarios written in perl, with perl 5.8.0 being a minimum requirement. As of now, TAP tests are divided into different pieces in the code tree:

src/test/perl/ contains the core set of modules. PostgresNode.pm is usually of the main interest as it allows to set up nodes, run psql on them, take base backups, etc. perldoc can also be used on them to get documentation about all the existing facilities.
Each binary in src/bin/ has its own set of tests, like initdb, pg_basebackup, etc. Those are the oldest ones, which have been introduced at the same time as the first Postgres TAP modules as of version 9.4.
src/test/recovery, set of tests mainly for replication and recovery. This is also one ancestor of most of the others since the introduction of PostgresNode.pm.
src/test/ssl, set of tests for OpenSSL, which is present for a couple of releases now.
src/test/authentication, introduced in Postgres 10, which has tests paths for SASLprep in SCRAM authentication.
src/test/subscription, for logical replication, and present since 10.
src/test/kerberos, set of tests for krb5, new as of 11.
src/test/ldap, which has tests for OpenLDAP, new as of 11.

Note that some of those tests are not designed to be able to run in shared environments as they would need to run Postgres while listening on hosts which are available to all, which is the case of the SSL, Kerberos and LDAP tests. Note that in v11 an environment variable has been added to run those tests automatically, called PG_TEST_EXTRA, so this can be used to automate all test runs in a consistent way:

PG_TEST_EXTRA='ssl ldap kerberos'

All basic TAP modules are installed with each PostgreSQL installation as part of lib/pgxs/src/test/perl/, so it is possible to include and run TAP tests within custom modules. One thing to know first is that the makefile of the module would need to be updated. In the most simplified way, if one wishes to keep make targets consistent with upstream for regression tests, one could just use that:

check:
        $(prove_check)
installcheck:
        $(prove_installcheck)

Also, when setting up a PostgreSQL cluster, using the pg_regress command is a necessary requirement as it can be used to set up nodes in a way which does not harm in ways described by CVE-2014-0067, however when trying to use PostgresNode.pm directly you will see failures, so tests need to be made aware of the location of the command and then set the environment variable PG_REGRESS in the run. One trick that I have found handy here is to use pg_config –libdir to find the base location, and then register the location before initializing nodes, like that for example.

use strict;
use warnings;
use PostgresNode;

# Run a simple command and grab its stdout output into a result
# given back to caller.
sub run_simple_command
{
    my ($cmd, $test_name) = @_;
    my $stdoutfile = File::Temp->new();
    my $stderrfile = File::Temp->new();
    my $result = IPC::Run::run $cmd, '>', $stdoutfile, '2>', $stderrfile;
    my $stdout = slurp_file($stdoutfile);

    ok($result, $test_name);
    chomp($stdout);
    return $stdout;
}

# Look at the binary position of pg_config and enforce the
# position of pg_regress to what is installed.
my $stdout = run_simple_command(['pg_config', '--libdir'],
    "fetch library directory using pg_config");
print "LIBDIR path found as $stdout\n";
$ENV{PG_REGRESS} = "$stdout/pgxs/src/test/regress/pg_regress";

prove_installcheck could be made smarter here, but there nothing to prevent the integration of TAP tests even in externally-maintained modules. So happy test-hacking.

↧

Achilleas Mantzios: PostgreSQL Audit Logging Best Practices

May 25, 2018, 2:58 am

≫ Next: Joshua Drake: PostgresConf US 2018, Financial Review

≪ Previous: Michael Paquier: TAP tests and external modules

In every IT system where important business tasks take place, it is important to have an explicit set of policies and practices, and to make sure those are respected and followed.

Introduction to Auditing

An Information Technology system audit is the examination of the policies, processes, procedures, and practices of an organization regarding IT infrastructure against a certain set of objectives. An IT audit may be of two generic types:

Checking against a set of standards on a limited subset of data
Checking the whole system

An IT audit may cover certain critical system parts, such as the ones related to financial data in order to support a specific set of regulations (e.g. SOX), or the entire security infrastructure against regulations such as the new EU GDPR regulation which addresses the need for protecting privacy and sets the guidelines for personal data management. The SOX example is of the former type described above whereas GDPR is of the latter.

The Audit Lifecycle

Planning

The scope of an audit is dependent on the audit objective. The scope may cover a special application identified by a specific business activity, such as a financial activity, or the whole IT infrastructure covering system security, data security and so forth. The scope must be correctly identified beforehand as an early step in the initial planning phase. The organization is supposed to provide to the auditor all the necessary background information to help with planning the audit. This may be the functional/technical specifications, system architecture diagrams or any other information requested.

Control Objectives

Based on the scope, the auditor forms a set of control objectives to be tested by the audit. Those control objectives are implemented via management practices that are supposed to be in place in order to achieve control to the extent described by the scope. The control objectives are associated with test plans and those together constitute the audit program. Based on the audit program the organization under audit allocates resources to facilitate the auditor.

Findings

The auditor tries to get evidence that all control objectives are met. If for some control objective there is no such evidence, first the auditor tries to see if there is some alternative way that the company handles the specific control objective, and in case such a way exists then this control objective is marked as compensating and the auditor considers that the objective is met. If however there is no evidence at all that an objective is met, then this is marked as a finding. Each finding consists of the condition, criteria, cause, effect and recommendation. The IT manager must be in close contact with the auditor in order to be informed of all potential findings and make sure that all requested information are shared between the management and the auditor in order to assure that the control objective is met (and thus avoid the finding).

The Assessment Report

At the end of the audit process the auditor will write an assessment report as a summary covering all important parts of the audit, including any potential findings followed by a statement on whether the objective is adequately addressed and recommendations for eliminating the impact of the findings.

What is Audit Logging and Why Should You Do It?

The auditor wants to have full access to the changes on software, data and the security system. He/she not only wants to be able to track down any change to the business data, but also track changes to the organizational chart, the security policy, the definition of roles/groups and changes to role/group membership. The most common way to perform an audit is via logging. Although it was possible in the past to pass an IT audit without log files, today it is the preferred (if not the only) way.

Typically the average IT system comprises of at least two layers:

Database
Application (possibly on top of an application server)

The application maintains its own logs covering user access and actions, and the database and possibly the application server systems maintain their own logs. Clean, readily usable information in log files which has real business value from the auditor perspective is called an audit trail. Audit trails differ from ordinary log files (sometimes called native logs) in that:

Log files are dispensable
Audit trails should be kept for longer periods
Log files add overhead to the system’s resources
Log files’ purpose is to help the system admin
Audit trails’ purpose is to help the auditor

We summarise the above in the following table:

Log type	App/System	Audit Trail friendly
App logs	App	Yes
App server logs	System	No
Database logs	System	No

App logs may be easily tailored to be used as audit trails. System logs not so easily because:

They are limited in their format by the system software
They act globally on the whole system
They don’t have direct knowledge about specific business context
They usually require additional software for later offline parsing/processing in order to produce usable audit-friendly audit trails.

However on the other hand App logs place an additional software layer on top of the actual data, thus:

Making the audit system more vulnerable to application bugs/misconfiguration
Creating a potential hole in the logging process if someone tries to access data directly on the database bypassing the app logging system, such as a privileged user or a DBA
Making the audit system more complex and harder to manage and maintain in case we have many applications or many software teams.

So, ideally we would be looking for the best of the two: Having usable audit trails with the greatest coverage on the whole system including database layer, and configurable in one place, so that the logging itself can be easily audited by means of other (system) logs.

Audit Logging with PostgreSQL

The options we have in PostgreSQL regarding audit logging are the following:

By using exhaustive logging ( log_statement = all )
By writing a custom trigger solution
By using standard PostgreSQL tools provided by the community, such as
- audit-trigger 91plus (https://github.com/2ndQuadrant/audit-trigger)
- pgaudit extension (https://github.com/pgaudit/pgaudit)

Exhaustive logging at least for standard usage in OLTP or OLAP workloads should be avoided because:

Produces huge files, increases load
Does not have inner knowledge of tables being accessed or modified, just prints the statement which might be a DO block with a cryptic concatenated statement
Needs additional software/resources for offline parsing and processing (in order to produce the audit trails) which in turn must be included in the scope of the audit, to be considered trustworthy

In the rest of this article we will try the tools provided by the community. Let’s suppose that we have this simple table that we want to audit:

myshop=# \d orders
                                       Table "public.orders"
   Column   |           Type           | Collation | Nullable |              Default               
------------+--------------------------+-----------+----------+------------------------------------
 id         | integer                  |           | not null | nextval('orders_id_seq'::regclass)
 customerid | integer                  |           | not null |
 customer   | text                     |           | not null |
 xtime      | timestamp with time zone   |           | not null | now()
 productid  | integer                  |           | not null |
 product    | text                     |           | not null |
 quantity   | integer                  |           | not null |
 unit_price | double precision         |           | not null |
 cur        | character varying(20)    |           | not null | 'EUR'::character varying
Indexes:
    "orders_pkey" PRIMARY KEY, btree (id)

audit-trigger 91plus

The docs about using the trigger can be found here: https://wiki.postgresql.org/wiki/Audit_trigger_91plus. First we download and install the provided DDL (functions, schema):

$ wget https://raw.githubusercontent.com/2ndQuadrant/audit-trigger/master/audit.sql
$ psql myshop
psql (10.3 (Debian 10.3-1.pgdg80+1))
Type "help" for help.
myshop=# \i audit.sql

Then we define the triggers for our table orders using the basic usage:

myshop=# SELECT audit.audit_table('orders');

This will create two triggers on table orders: a insert_update_delere row trigger and a truncate statement trigger. Now let’s see what the trigger does:

myshop=# insert into orders (customer,customerid,product,productid,unit_price,quantity) VALUES('magicbattler',1,'some fn skin 2',2,5,2);      
INSERT 0 1
myshop=# update orders set quantity=3 where id=2;
UPDATE 1
myshop=# delete from orders  where id=2;
DELETE 1
myshop=# select table_name, action, session_user_name, action_tstamp_clk, row_data, changed_fields from audit.logged_actions;
-[ RECORD 1 ]-----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
table_name        | orders
action            | I
session_user_name | postgres
action_tstamp_clk | 2018-05-20 00:15:10.887268+03
row_data          | "id"=>"2", "cur"=>"EUR", "xtime"=>"2018-05-20 00:15:10.883801+03", "product"=>"some fn skin 2", "customer"=>"magicbattler", "quantity"=>"2", "productid"=>"2", "customerid"=>"1", "unit_price"=>"5"
changed_fields    |
-[ RECORD 2 ]-----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
table_name        | orders
action            | U
session_user_name | postgres
action_tstamp_clk | 2018-05-20 00:16:12.829065+03
row_data          | "id"=>"2", "cur"=>"EUR", "xtime"=>"2018-05-20 00:15:10.883801+03", "product"=>"some fn skin 2", "customer"=>"magicbattler", "quantity"=>"2", "productid"=>"2", "customerid"=>"1", "unit_price"=>"5"
changed_fields    | "quantity"=>"3"
-[ RECORD 3 ]-----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
table_name        | orders
action            | D
session_user_name | postgres
action_tstamp_clk | 2018-05-20 00:16:24.944117+03
row_data          | "id"=>"2", "cur"=>"EUR", "xtime"=>"2018-05-20 00:15:10.883801+03", "product"=>"some fn skin 2", "customer"=>"magicbattler", "quantity"=>"3", "productid"=>"2", "customerid"=>"1", "unit_price"=>"5"
changed_fields    |

Note the changed_fields value on the Update (RECORD 2). There are more advanced uses of the audit trigger, like excluding columns, or using the WHEN clause as shown in the doc. The audit trigger sure seems to do the job of creating useful audit trails inside the audit.logged_actions table. However there are some caveats:

No SELECTs (triggers do not fire on SELECTs) or DDL are tracked
Changes by table owners and super users can be easily tampered
Best practices must be followed regarding the app user(s) and app schema and tables owners

PostgreSQL Management & Automation with ClusterControl

Learn about what you need to know to deploy, monitor, manage and scale PostgreSQL

Download the Whitepaper

Pgaudit

Pgaudit is the newest addition to PostgreSQL as far as auditing is concerned. Pgaudit must be installed as an extension, as shown in the project’s github page: https://github.com/pgaudit/pgaudit. Pgaudit logs in the standard PostgreSQL log. Pgaudit works by registering itself upon module load and providing hooks for the executorStart, executorCheckPerms, processUtility and object_access. Therefore pgaudit (in contrast to trigger-based solutions such as audit-trigger discussed in the previous paragraphs) supports READs (SELECT, COPY). Generally with pgaudit we can have two modes of operation or use them combined:

SESSION audit logging
OBJECT audit logging

Session audit logging supports most DML, DDL, privilege and misc commands via classes:

READ (select, copy from)
WRITE (insert, update, delete, truncate, copy to)
FUNCTION (function calls and DO blocks)
ROLE (grant, revoke, create/alter/drop role)
DDL (all DDL except those in ROLE)
MISC (discard, fetch, checkpoint, vacuum)

Metaclass “all” includes all classes. - excludes a class. For instance let us configure Session audit logging for all except MISC, with the following GUC parameters in postgresql.conf:

pgaudit.log_catalog = off
pgaudit.log = 'all, -misc'
pgaudit.log_relation = 'on'
pgaudit.log_parameter = 'on'

By giving the following commands (the same as in the trigger example)

myshop=# insert into orders (customer,customerid,product,productid,unit_price,quantity) VALUES('magicbattler',1,'some fn skin 2',2,5,2);
INSERT 0 1
myshop=# update orders set quantity=3 where id=2;
UPDATE 1
myshop=# delete from orders  where id=2;
DELETE 1
myshop=#

We get the following entries in PostgreSQL log:

% tail -f data/log/postgresql-22.log | grep AUDIT:
[local] [55035] 5b03e693.d6fb 2018-05-22 12:46:37.352 EEST psql postgres@testdb line:7 LOG:  AUDIT: SESSION,5,1,WRITE,INSERT,TABLE,public.orders,"insert into orders (customer,customerid,product,productid,unit_price,quantity) VALUES('magicbattler',1,'some fn skin 2',2,5,2);",<none>
[local] [55035] 5b03e693.d6fb 2018-05-22 12:46:50.120 EEST psql postgres@testdb line:8 LOG:  AUDIT: SESSION,6,1,WRITE,UPDATE,TABLE,public.orders,update orders set quantity=3 where id=2;,<none>
[local] [55035] 5b03e693.d6fb 2018-05-22 12:46:59.888 EEST psql postgres@testdb line:9 LOG:  AUDIT: SESSION,7,1,WRITE,DELETE,TABLE,public.orders,delete from orders  where id=2;,<none>

Note that the text after AUDIT: makes up a perfect audit trail, almost ready to ship to the auditor in spreadsheet-ready csv format. Using session audit logging will give us audit log entries for all operations belonging to the classes defined by pgaudit.log parameter on all tables. However there are cases that we wish only a small subset of the data i.e. only a few tables to be audited. In such cases we may prefer object audit logging which gives us fine grained criteria to selected tables/columns via the PostgreSQL’s privilege system. In order to start using Object audit logging we must first configure the pgaudit.role parameter which defines the master role that pgaudit will use. It makes sense not to give this user any login rights.

CREATE ROLE auditor;
ALTER ROLE auditor WITH NOSUPERUSER INHERIT NOCREATEROLE NOCREATEDB NOLOGIN NOREPLICATION NOBYPASSRLS CONNECTION LIMIT 0;

The we specify this value for pgaudit.role in postgresql.conf:

pgaudit.log = none # no need for extensive SESSION logging
pgaudit.role = auditor

Pgaudit OBJECT logging will work by finding if user auditor is granted (directly or inherited) the right to execute the specified action performed on the relations/columns used in a statement. So if we need to ignore all tables, but have detailed logging to table orders, this is the way to do it:

grant ALL on orders to auditor ;

By the above grant we enable full SELECT, INSERT, UPDATE and DELETE logging on table orders. Let’s give once again the INSERT, UPDATE, DELETE of the previous examples and watch the postgresql log:

% tail -f data/log/postgresql-22.log | grep AUDIT:
[local] [60683] 5b040125.ed0b 2018-05-22 14:41:41.989 EEST psql postgres@testdb line:7 LOG:  AUDIT: OBJECT,2,1,WRITE,INSERT,TABLE,public.orders,"insert into orders (customer,customerid,product,productid,unit_price,quantity) VALUES('magicbattler',1,'some fn skin 2',2,5,2);",<none>
[local] [60683] 5b040125.ed0b 2018-05-22 14:41:52.269 EEST psql postgres@testdb line:8 LOG:  AUDIT: OBJECT,3,1,WRITE,UPDATE,TABLE,public.orders,update orders set quantity=3 where id=2;,<none>
[local] [60683] 5b040125.ed0b 2018-05-22 14:42:03.148 EEST psql postgres@testdb line:9 LOG:  AUDIT: OBJECT,4,1,WRITE,DELETE,TABLE,public.orders,delete from orders  where id=2;,<none>

We observe that the output is identical to the SESSION logging discussed above with the difference that instead of SESSION as audit type (the string next to AUDIT: ) now we get OBJECT.

One caveat with OBJECT logging is that TRUNCATEs are not logged. We have to resort to SESSION logging for this. But in this case we end up getting all WRITE activity for all tables. There are talks among the hackers involved to make each command a separate class.

Another thing to keep in mind is that in the case of inheritance if we GRANT access to the auditor on some child table, and not the parent, actions on the parent table which translate to actions on rows of the child table will not be logged.

In addition to the above, the IT people in charge for the integrity of the logs must document a strict and well defined procedure which covers the extraction of the audit trail from the PostgreSQL log files. Those logs might be streamed to an external secure syslog server in order to minimize the chances of any interference or tampering.

Tags:

PostgreSQL

audit logging

security

↧

Joshua Drake: PostgresConf US 2018, Financial Review

May 25, 2018, 2:30 pm

≫ Next: Amit Kapila: Parallel Index Scans In PostgreSQL

≪ Previous: Achilleas Mantzios: PostgreSQL Audit Logging Best Practices

The following table contains a summary profit and loss statement for PostgresConf US 2018.

In review of these numbers two things will probably jump out at you:

Venue and F&B of 238,000.12
Net Revenue of 202,201.62

Yes, we spent almost 250,000.00 dollars on the venue and food and beverage. In fact, the Food and Beverage alone was over 135,000.00 dollars.

We were fortunate to have very strong ticket sales as well as partner support through Sponsorships. This support will allow us to not only meet our financial requirements for PostgresConf Silicon Valley 2018 but will help us make our financial requirements for our European, Chinese, and US conferences in 2019. We are also hoping to set aside more money for our popular diversity and professional development initiatives.

The Chairs would like to thank all our organizers, volunteers, partners and attendees for helping us continue advocacy of People, Postgres, Data!

↧

Amit Kapila: Parallel Index Scans In PostgreSQL

May 25, 2018, 8:34 pm

≫ Next: Regina Obe: PostGIS 2.5.0alpha

≪ Previous: Joshua Drake: PostgresConf US 2018, Financial Review

There is a lot to say about parallelism in PostgreSQL. We have come a long way since I wrote my first post on this topic (Parallel Sequential Scans). Each of the past three releases (including PG-11, which is in its beta) have a parallel query as a major feature which in itself says how useful is this feature and the amount of work being done on this feature. You can read more about parallel query from the PostgreSQL docs or from a blog post on this topic by my colleague Robert Haas. The intent of this blog post is to talk about parallel index scans which were released in PostgreSQL 10. Currently, we have supported parallel scan for btree-indexes.

To demonstrate how the feature works, here is an example of TPC-H Q-6 at scale factor - 20 (which means approximately 20GB database). Q6 is a forecasting revenue change query. This query quantifies the amount of revenue increase that would have resulted from eliminating certain company-wide discounts in a given percentage range in a given year. Asking this type of "what if" query can be used to look for ways to increase revenues.

explain analyze

select sum(l_extendedprice * l_discount) as revenue
from lineitem

where l_shipdate >= date '1994-01-01' and

l_shipdate < date '1994-01-01' + interval '1' year and

l_discount between 0.02 - 0.01 and 0.02 + 0.01 and

l_quantity < 24
LIMIT 1;

Non-parallel version of plan
-------------------------------------
Limit
-> Aggregate
-> Index Scan using idx_lineitem_shipdate on lineitem
Index Cond: ((l_shipdate >= '1994-01-01'::date) AND (l_shipdate < '1995-01-01

00:00:00'::timestamp without time zone) AND (l_discount >= 0.01) AND

(l_discount <= 0.03) AND (l_quantity < '24'::numeric))

Planning Time: 0.406 ms
Execution Time: 35073.886 ms

Parallel version of plan
-------------------------------
Limit
-> Finalize Aggregate
-> Gather

Workers Planned: 2

Workers Launched: 2
-> Partial Aggregate
-> Parallel Index Scan using idx_lineitem_shipdate on lineitem
Index Cond: ((l_shipdate >= '1994-01-01'::date) AND (l_shipdate < '1995-01-01

00:00:00'::timestamp without time zone) AND (l_discount >= 0.01) AND

(l_discount <= 0.03) AND (l_quantity < '24'::numeric))

Planning Time: 0.420 ms
Execution Time: 15545.794 ms

We can see that the execution time is reduced by more than half for a parallel plan with two parallel workers. This query filters many rows and the work (CPU time) to perform that is divided among workers (and leader), leading to reduced time.

To further see the impact with a number of workers, we have used somewhat bigger dataset (scale_factor = 50). The setup has been done using TPC-H like benchmark for PostgreSQL. We have also created few additional indexes on columns (l_shipmode, l_shipdate, o_orderdate, o_comment)

Non-default parameter settings:
random_page_cost = seq_page_cost = 0.1
effective_cache_size = 10GB
shared_buffers = 8GB
work_mem = 1GB

The time is reduced almost linearly till 8 workers and then it reduced slowly. The further increase in workers won’t help unless the data to scan increases.

We have further evaluated the parallel index scan feature for all the queries in TPC-H benchmark and found that it is used in a number of queries and the impact is positive (reduced the execution time significantly). Below are results for TPC-H, scale factor - 20 with a number of parallel workers as 2. X-axis indicates (1: Q-6, 2: Q14, 3: Q18).

Under the Hood

The basic idea is quite similar to parallel heap scans where each worker (including leader whenever possible) will scan a block (all the tuples in a block) and then get the next block that is required to be scan. The parallelism is implemented at the leaf level of a btree. The first worker to start a btree scan will scan till leaf and others will wait till the first worker has reached till leaf. Once, the first worker read the leaf block, it sets the next block to be read and wakes one of the workers waiting to scan blocks. Further, it proceeds scanning tuples from the block it has read. Henceforth, each worker after reading a block sets the next block to be read and wakes up the next waiting worker. This continues till no more pages are left to scan at which we end the parallel scan and notify all the workers.

A new guc min_parallel_index_scan_size has been introduced which indicates the minimum amount of index data that must be scanned in order for a parallel scan to be considered. Users can try changing the value of this parameter to see if the parallel index plan is effective for their queries. The number of parallel workers is decided based on the number of index pages to be scanned. The final cost of parallel plan considers the cost (CPU cost) to process the rows is divided equally among workers.

In the end, I would like to thank the people (Rahila Syed and Robert Haas) who were involved in this work (along with me) and my employer EnterpriseDB who has supported this work. I would also like to thank Rafia Sabih who helped me in doing performance testing for this blog.

↧

Regina Obe: PostGIS 2.5.0alpha

May 27, 2018, 5:00 pm

≫ Next: Liaqat Andrabi: Introduction to Postgres-BDR [Webinar Follow-up]

≪ Previous: Amit Kapila: Parallel Index Scans In PostgreSQL

The PostGIS development team is pleased to release PostGIS 2.5.0alpha.

This release is a work in progress, and some features are still slated to be added. Although this release will work for PostgreSQL 9.4 and above, to take full advantage of what PostGIS 2.5 will offer, you should be running PostgreSQL 11beta+.

Best served with PostgreSQL 11beta which was recently released.

View all closed tickets for 2.5.0.

After installing the binaries or after running pg_upgrade, make sure to do:

ALTER EXTENSION postgis UPDATE;

— if you use the other extensions packaged with postgis — make sure to upgrade those as well

ALTER EXTENSION postgis_sfcgal UPDATE;
ALTER EXTENSION postgis_topology UPDATE;
ALTER EXTENSION postgis_tiger_geocoder UPDATE;

If you use legacy.sql or legacy_minimal.sql, make sure to rerun the version packaged with these releases.

2.5.0alpha

↧

Liaqat Andrabi: Introduction to Postgres-BDR [Webinar Follow-up]

May 27, 2018, 11:36 pm

≫ Next: Sebastian Insausti: PostgreSQL Streaming Replication - a Deep Dive

≪ Previous: Regina Obe: PostGIS 2.5.0alpha

The announcement of Postgres-BDR 3.0 last month at PostgresConf US had been long-awaited in the PostgreSQL community. Complex use cases from customers have driven the development of BDR far beyond the original feature set, resulting in a more robust technology than ever imagined – so we’re happy to say that it was worth the wait.

Postgres-BDR (Bi-Directional Replication) enhances PostgreSQL with advanced multi-master replication technology that can be used to implement very high availability applications.

For an introduction to Postgres-BDR covering an overview of its complex architecture and its common use cases – 2ndQuadrant held the “Introduction to Postgres-BDR” webinar as part of its PostgreSQL webinar series.

The webinar was presented by Simon Riggs, Founder and CEO of 2ndQuadrant, who is also one of the committers of the PostgreSQL project. Those who weren’t able to attend the live event can now view the recording here.

Some additional Questions and Answers are shown here:

Q: Is it necessary for all nodes to replicate data to all other nodes?

A: BDR has a fully connected network of nodes based on mesh topology. All nodes replicate data to all other nodes. You can, however, choose a subset of tables to replicate within a node.

Q: Can I use a star topology instead of a network (all with all)?

A: BDR has a fully connected network of nodes based on mesh topology, which is the only topology currently supported.

Q: Are there any tools available that help you resolve conflict visually?

A: BDR resolves conflicts automatically. Also, it keeps a log of resolved conflicts as well as any unresolvable problems for later assessment. These conflicts are visible using GUI tools as well as command line access.

Q: Does Postgres-BDR has Lock conflict management? If yes, how it works between the different nodes?

A: It is a deliberate design choice by Postgres-BDR to not propagate lock information since it would cause an unacceptable performance loss across long distance network links, as occurs in more tightly coupled clustering solutions. This allows BDR nodes to operate independently apart from replicating data changes.

Q: Can you elaborate on backup/recovery options in BDR3?

A: You can get more detailed information by downloading the BDR3 white paper from this link.

For any questions or comments regarding Postgres-BDR, please send an email to info@2ndQuadrant.com. You can also check out past webinars from ourPostgreSQL Webinar serieshere.

↧

Sebastian Insausti: PostgreSQL Streaming Replication - a Deep Dive

May 28, 2018, 6:31 am

≫ Next: Hans-Juergen Schoenig: PostgreSQL: Detecting periods of activity in a timeseries

≪ Previous: Liaqat Andrabi: Introduction to Postgres-BDR [Webinar Follow-up]

Knowledge of high availability is a must for anybody managing PostgreSQL. It is a topic that we have seen over and over, but that never gets old. In this blog, we are going to review a little bit of the history of PostgreSQL built-in replication features and deep dive into how streaming replication works.

When talking about replication, we will be talking a lot about WALs. So, let's review a little bit what is this about.

Write Ahead Log (WAL)

Write Ahead Log is a standard method for ensuring data integrity, it is automatically enabled by default.

The WALs are the REDO logs in PostgreSQL. But, what are the REDO logs?

REDO logs contain all changes that were made in the database and they are used by replication, recovery, online backup and point in time recovery (PITR). Any changes that have not been applied to the data pages can be redone from the REDO logs.

Using WAL results in a significantly reduced number of disk writes, because only the log file needs to be flushed to disk to guarantee that a transaction is committed, rather than every data file changed by the transaction.

A WAL record will specify, bit by bit, the changes made to the data. Each WAL record will be appended into a WAL file. The insert position is a Log Sequence Number (LSN) that is a byte offset into the logs, increasing with each new record.

The WALs are stored in pg_xlog (or pg_wal in PostgreSQL 10) directory, under the data directory. These files have a default size of 16MB (the size can be changed by altering the --with-wal-segsize configure option when building the server). They have a unique incremental name, in the following format: "00000001 00000000 00000000".

The number of WAL files contained in pg_xlog (or pg_wal) will depend on the value assigned to the parameter checkpoint_segments (or min_wal_size and max_wal_size, depending on the version) in the postgresql.conf configuration file.

One parameter that we need to setup when configuring all our PostgreSQL installations is the wal_level. It determines how much information is written to the WAL .The default value is minimal, which writes only the information needed to recover from a crash or immediate shutdown. Archive adds logging required for WAL archiving; hot_standby further adds information required to run read-only queries on a standby server; and, finally logical adds information necessary to support logical decoding. This parameter requires a restart, so, it can be hard to change on running prod databases if we have forgotten that.

For further information, you can check the official documentation here or here. Now that we’ve covered the WAL, let's review the replication history…

History of replication in PostgreSQL

The first replication method (warm standby) that PostgreSQL implemented (version 8.2 , back in 2006) was based on the log shipping method.

This means that the WAL records are directly moved from one database server to another to be applied. We can say that is a continuous PITR.

PostgreSQL implements file-based log shipping by transferring WAL records one file (WAL segment) at a time.

This replication implementation has the downside that if there is a major failure on the primary servers, transactions not yet shipped will be lost. So there is a window for data loss (you can tune this by using the archive_timeout parameter, which can be set to as low as a few seconds, but such a low setting will substantially increase the bandwidth required for file shipping).

We can represent this method with the picture below:

PostgreSQL file-based log shipping

So, on version 9.0 (back in 2010), streaming replication was introduced.

This feature allowed us to stay more up-to-date than is possible with file-based log shipping, by transferring WAL records (a WAL file is composed of WAL records) on the fly (record based log shipping), between a master server and one or several slave servers, without waiting for the WAL file to be filled.

In practice, a process called WAL receiver, running on the slave server, will connect to the master server using a TCP/IP connection. In the master server another process exists, named WAL sender, and is in charge of sending the WAL registries to the slave server as they happen.

Streaming replication can be represented as following:

PostgreSQL Streaming replication

By looking at the above diagram we can think, what happens when the communication between the WAL sender and the WAL receiver fails?

When configuring streaming replication, we have the option to enable WAL archiving.

This step is actually not mandatory, but is extremely important for robust replication setup, as it is necessary to avoid the main server to recycle old WAL files that have not yet being applied to the slave. If this occurs we will need to recreate the replica from scratch.

When configuring replication with continuous archiving (as explained here), we are starting from a backup and, to reach the on sync state with the master, we need to apply all the changes hosted in the WAL that happened after the backup. During this process, the standby will first restore all the WAL available in the archive location (done by calling restore_command). The restore_command will fail when we reach the last archived WAL record, so after that, the standby is going to look on the pg_wal (pg_xlog) directory to see if the change exists there (this is actually made to avoid data loss when the master servers crashes and some changes that have already been moved into the replica and applied there have not been yet archived).

If that fails, and the requested record does not exist there, then it will start communicating with the master through streaming replication.

Whenever streaming replication fails, it will go back to step 1 and restore the records from archive again. This loop of retries from the archive, pg_wal, and via streaming replication goes on until the server is stopped or failover is triggered by a trigger file.

This will be a diagram of such configuration:

PostgreSQL streaming replication with continuous archiving

Streaming replication is asynchronous by default, so at some given moment we can have some transactions that can be committed in the master and not yet replicated into the standby server. This implies some potential data loss.

However this delay between the commit and impact of the changes in the replica is supposed to be really small (some milliseconds), assuming of course that the replica server is powerful enough to keep up with the load.

For the cases when even the risk of a small data loss is not tolerable, version 9.1 introduced the synchronous replication feature.

In synchronous replication each commit of a write transaction will wait until confirmation is received that the commit has been written to the write-ahead log on disk of both the primary and standby server.

This method minimizes the possibility of data loss, as for that to happen we will need for both, the master and the standby to fail at the same time.

The obvious downside of this configuration is that the response time for each write transaction increases, as we need to wait until all parties have responded. So the time for a commit is, at minimum, the round trip between the master and the replica. Readonly transactions will not be affected by that.

To setup synchronous replication we need for each of the stand-by servers to specify an application_name in the primary_conninfo of the recovery.conf file: primary_conninfo = '...aplication_name=slaveX' .

We also need to specify the list of the stand-by servers that are going to take part in the synchronous replication : synchronous_standby_name = 'slaveX,slaveY'.

We can setup one or several synchronous servers, and this parameter also specifies which method (FIRST and ANY) to choose synchronous standbys from the listed ones. For more information on how to setup this replication mode please refer here. It is also possible to set up synchronous replication when deploying via ClusterControl, from version 1.6.1 (which is released at the time of writing).

After we have configured our replication, and it is up and running, we will need to have some monitoring over it.

Monitoring PostgreSQL Replication

The pg_stat_replication view on the master server has a lot of relevant information:

postgres=# SELECT * FROM pg_stat_replication;
 pid | usesysid | usename | application_name |  client_addr   | client_hostname | client_port |         backend_start         | backend_xmin |   state   | sent_lsn  | write_lsn | 
flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state 
-----+----------+---------+------------------+----------------+-----------------+-------------+-------------------------------+--------------+-----------+-----------+-----------+-
----------+------------+-----------+-----------+------------+---------------+------------
 994 |    16467 | repl    | walreceiver      | 192.168.100.42 |                 |       37646 | 2018-05-24 21:27:57.256242-03 |              | streaming | 0/50002C8 | 0/50002C8 | 
0/50002C8 | 0/50002C8  |           |           |            |             0 | async
(1 row)

Let's see this in detail:

pid: Process id of walsender process
usesysid: OID of user which is used for Streaming replication.
usename: Name of user which is used for Streaming replication
application_name: Application name connected to master
client_addr: Address of standby/streaming replication
client_hostname: Hostname of standby.
client_port: TCP port number on which standby communicating with WAL sender
backend_start: Start time when SR connected to Master.
state: Current WAL sender state i.e streaming
sent_lsn: Last transaction location sent to standby.
write_lsn: Last transaction written on disk at standby
flush_lsn: Last transaction flush on disk at standby.
replay_lsn: Last transaction flush on disk at standby.
sync_priority: Priority of standby server being chosen as synchronous standby
sync_state: Sync State of standby (is it async or synchronous).

We can also see the WAL sender/receiver processes running on the servers.

Sender (Primary Node):

[root@postgres1 ~]# ps aux |grep postgres
postgres   833  0.0  1.6 392032 16532 ?        Ss   21:25   0:00 /usr/pgsql-10/bin/postmaster -D /var/lib/pgsql/10/data/
postgres   847  0.0  0.1 244844  1900 ?        Ss   21:25   0:00 postgres: logger process   
postgres   850  0.0  0.3 392032  3696 ?        Ss   21:25   0:00 postgres: checkpointer process   
postgres   851  0.0  0.3 392032  3180 ?        Ss   21:25   0:00 postgres: writer process   
postgres   852  0.0  0.6 392032  6340 ?        Ss   21:25   0:00 postgres: wal writer process   
postgres   853  0.0  0.3 392440  3052 ?        Ss   21:25   0:00 postgres: autovacuum launcher process   
postgres   854  0.0  0.2 247096  2172 ?        Ss   21:25   0:00 postgres: stats collector process   
postgres   855  0.0  0.2 392324  2504 ?        Ss   21:25   0:00 postgres: bgworker: logical replication launcher   
postgres   994  0.0  0.3 392440  3528 ?        Ss   21:27   0:00 postgres: wal sender process repl 192.168.100.42(37646) streaming 0/50002C8

Receiver (Standby Node):

[root@postgres2 ~]# ps aux |grep postgres
postgres   833  0.0  1.6 392032 16436 ?        Ss   21:27   0:00 /usr/pgsql-10/bin/postmaster -D /var/lib/pgsql/10/data/
postgres   848  0.0  0.1 244844  1908 ?        Ss   21:27   0:00 postgres: logger process   
postgres   849  0.0  0.2 392128  2580 ?        Ss   21:27   0:00 postgres: startup process   recovering 000000010000000000000005
postgres   851  0.0  0.3 392032  3472 ?        Ss   21:27   0:00 postgres: checkpointer process   
postgres   852  0.0  0.3 392032  3216 ?        Ss   21:27   0:00 postgres: writer process   
postgres   853  0.0  0.1 246964  1812 ?        Ss   21:27   0:00 postgres: stats collector process   
postgres   854  0.0  0.3 398860  3840 ?        Ss   21:27   0:05 postgres: wal receiver process   streaming 0/50002C8

One way of checking how up to date is our replication is by checking the amount of WAL records generated in the primary, but not yet applied in the standby.

Master:

postgres=# SELECT pg_current_wal_lsn();
pg_current_wal_lsn
--------------------
0/50002C8
(1 row)

Note: This function is for PostgreSQL 10. For previous versions, you need to use: SELECT pg_current_xlog_location();

Slave:

postgres=# SELECT pg_last_wal_receive_lsn();
pg_last_wal_receive_lsn
-------------------------
0/50002C8
(1 row)

postgres=# SELECT pg_last_wal_replay_lsn();
pg_last_wal_replay_lsn
------------------------
0/50002C8
(1 row)

Note: These functions are for PostgreSQL 10. For previous versions, you need to use: SELECT pg_last_xlog_receive_location(); and SELECT pg_last_xlog_replay_location();

We can use the following query to get the lag in seconds.

PostgreSQL 10:

SELECT CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()
THEN 0
ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp())
END AS log_delay;

Previous Versions:

SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location()
THEN 0
ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp())
END AS log_delay;

Output:

postgres=# SELECT CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()
postgres-# THEN 0
postgres-# ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp())
postgres-# END AS log_delay;
log_delay
-----------
        0
(1 row)

To deploy streaming replication setups (synchronous or asynchronous), we can use ClusterControl:

Deploying PostgreSQL replication setups

It also allows us to monitor the replication lag, as well as other key metrics.

PostgreSQL Overview

PostgreSQL Topology View

As streaming replication is based on shipping the WAL records and them being applied to the standby server, it is basically saying what bytes to add or change in what file. As a result, the standby server is actually a bit by bit copy of the master.

We have here some well known limitations:

We cannot replicate into a different version or architecture.
We cannot change anything on the standby server.
We do not have much granularity on what we can replicate.

So, for overcoming these limitations, PostgreSQL 10 has added support for logical replication.

Logical Replication

Logical replication will also use the information in the WAL file, but it will decode it into logical changes. Instead of knowing which byte has changed, we will know exactly what data has been inserted in which table.

It is based in a publish and subscribe model with one or more subscribers, subscribing to one or more publications on a publisher node that looks like this:

PostgreSQL Logical Replication

To know more about Logical Replication in PostgreSQL, we can check the following blog.

With this replication option there many cases that now become possible, like replicating only some of the tables or consolidating multiple databases into a single one.

What new features will come? We will need to stay tuned and check, but we hope that master-master built-in replication is not far away.

Tags:

postgres

streaming replication

high availability

↧

Hans-Juergen Schoenig: PostgreSQL: Detecting periods of activity in a timeseries

May 29, 2018, 2:15 am

≫ Next: Joshua Drake: PostgresConf Silicon Valley: October 15th and 16th 2018, CFP is now open!

≪ Previous: Sebastian Insausti: PostgreSQL Streaming Replication - a Deep Dive

I have already written about timeseries and PostgreSQL in the past. However, recently I stumbled across an interesting problem, which caught my attention: Sometimes you might want to find “periods” of activity in a timseries. For example: When was a user active? Or when did we receive data? This blog post tries to give you some ideas and shows, how you can actually approach this kind of problem.

Loading timeseries data into PostgreSQL

The next listing shows a little bit of sample data, which I used to write the SQL code you are about to see:

CREATE TABLE t_series (t date, data int);

COPY t_series FROM stdin DELIMITER ';';
2018-03-01;12
2018-03-02;43
2018-03-03;9
2018-03-04;13
2018-03-09;23
2018-03-10;26
2018-03-11;28
2018-03-14;21
2018-03-15;15
\.

For the sake of simplicity I just used two columns in my example. Note that my timeseries is not continuous but interrupted. There are three continuous periods in this set of data. Our goal is to find and isolate them to do analysis on each of those continuous periods.

Preparing for timeseries analysis

When dealing with timeseries one of the most important things to learn is how to “look forward and backward”. In most cases it is simply vital to compare the current line with the previous line. To do that in PostgreSQL (or in SQL in general) you can make use of the “lag” function:

test=# SELECT *, lag(t, 1) OVER (ORDER BY t)
       FROM t_series;
          t | data | lag
------------+------+----------
 2018-03-01 |   12 | 
 2018-03-02 |   43 | 2018-03-01
 2018-03-03 |    9 | 2018-03-02
 2018-03-04 |   13 | 2018-03-03
 2018-03-09 |   23 | 2018-03-04
 2018-03-10 |   26 | 2018-03-09
 2018-03-11 |   28 | 2018-03-10
 2018-03-14 |   21 | 2018-03-11
 2018-03-15 |   15 | 2018-03-14
(9 rows)

As you can see the last column contains the date of the previous row. Now: How does PostgreSQL know what the previous row actually is? The “ORDER BY”-clause will define exactly that.

Based on this query you have just seen it will be easy to calculate the size of the gap from one row to the next row

test=# SELECT *, t - lag(t, 1) OVER (ORDER BY t) AS diff
       FROM t_series;
          t | data | diff 
------------+------+------
 2018-03-01 |   12 | 
 2018-03-02 |   43 | 1
 2018-03-03 |    9 | 1
 2018-03-04 |   13 | 1
 2018-03-09 |   23 | 5
 2018-03-10 |   26 | 1
 2018-03-11 |   28 | 1
 2018-03-14 |   21 | 3
 2018-03-15 |   15 | 1
(9 rows)

What we see now is the difference from one period to the next. That is pretty useful because we can start to create our rules. When do we consider a segment to be over and how long of a gap to we allow for before we consider it to be the next segment / period?

In my example I decided that every gap, which is longer than 2 days should trigger the creation of a new segment (or period): The next challenge is therefore to assign numbers to each period, which are about to detect. Once this is done, we can easily aggregate on the result. The way I have decided to do this is by using the sum function. Remember: When NULL is fed to an aggregate, the aggregate will ignore the input. Otherwise it will simply start to add up the input.

Here is the query:

test=# SELECT *, sum(CASE WHEN diff IS NULL 
                     OR diff <2 THEN 1 ELSE NULL END) OVER (ORDER BY t) AS period
       FROM (SELECT *, t - lag(t, 1) OVER (ORDER BY t) AS diff
             FROM   t_series
       ) AS x;
          t | data | diff | period 
------------+------+------+--------
 2018-03-01 |   12 |      | 1
 2018-03-02 |   43 |    1 | 1
 2018-03-03 |    9 |    1 | 1
 2018-03-04 |   13 |    1 | 1
 2018-03-09 |   23 |    5 | 2
 2018-03-10 |   26 |    1 | 2
 2018-03-11 |   28 |    1 | 2
 2018-03-14 |   21 |    3 | 3
 2018-03-15 |   15 |    1 | 3
(9 rows)

As you can see the last column contains the period ID as generated by the sum function in our query. From now on analysis will be pretty simple as we can simply aggregate over this result using a simple subselect as shown in the next statement:

test=# SELECT period, sum(data) 
       FROM (SELECT *, sum(CASE WHEN diff IS NULL 
                    OR diff <2 THEN 1 ELSE NULL END) OVER (ORDER BY t) AS period
             FROM (SELECT *, t - lag(t, 1) OVER (ORDER BY t) AS diff
                   FROM t_series
                  ) AS x
       ) AS y
GROUP BY period 
ORDER BY period;
 period | sum 
--------+-----
      1 | 77
      2 | 77
      3 | 36
(3 rows)

The result displays the sum of all data for each period. Of course you can also do more complicated stuff. However, the important thing is to understand, how you can actually detect various periods of continuous activity.

The post PostgreSQL: Detecting periods of activity in a timeseries appeared first on Cybertec.

↧

Joshua Drake: PostgresConf Silicon Valley: October 15th and 16th 2018, CFP is now open!

May 29, 2018, 10:00 am

≫ Next: Brian Fehrle: Why is PostgreSQL Running Slow? Tips & Tricks to Get to the Source

≪ Previous: Hans-Juergen Schoenig: PostgreSQL: Detecting periods of activity in a timeseries

PostgresConf, in partnership with Silicon Valley Postgres, is pleased to announce that the call for papers for PostgresConf Silicon Valley is open.

The inaugural PostgresConf Silicon Valley will be held October 15th - 16th, 2018 at the Hilton San Jose (300 Almaden Boulevard, San Jose, CA 95110).

This two day, three track conference is a perfect opportunity for users, developers, business analysts, and enthusiasts from Silicon Valley and San Francisco to amplify Postgres and participate in the Postgres community.

The Call for Papers for PostgresConf Silicon Valley can be found here

Call for papers will be open from May 23rd until August 15th. Speakers will be notified of acceptance/decline no later than August 20th.

Conference Schedule at a glance:

Monday, October 15th: Trainings and Data track
Tuesday, October 16th: Keynotes, Dev and Ops tracks

Partner Opportunities

PostgresConf Silicon Valley is supported by its generous sponsors:

Conference Sponsors: Amazon Web Services and Pivotal
Premiere Sponsors: Compose, 2ndQuadrant, Timescale, and Microsoft

Please contact us if you are interested in becoming a partner!

About PostgresConf:

PostgresConf is a global nonprofit conference series with a focus on growing community through increased awareness and education of Postgres and related technologies. PostgresConf is known for its highly attended national conference held in Jersey City, New Jersey with the mission of:

Contact: siliconvalley@postgresconf.org

↧

Brian Fehrle: Why is PostgreSQL Running Slow? Tips & Tricks to Get to the Source

May 31, 2018, 2:58 am

≫ Next: Luca Ferrari: Statements with RETURNING: Perl and Java clients

≪ Previous: Joshua Drake: PostgresConf Silicon Valley: October 15th and 16th 2018, CFP is now open!

As a PostgreSQL Database Administrator, there are the everyday expectations to check on backups, apply DDL changes, make sure the logs don’t have any game breaking ERROR’s, and answer panicked calls from developers who’s reports are running twice as long as normal and they have a meeting in ten minutes.

Even with a good understanding of the health of managed databases, there will always be new cases and new issues popping up relating to performance and how the database “feels”. Whether it’s a panicked email, or an open ticket for “the database feels slow”, this common task can generally be followed with a few steps to check whether or not there is a problem with PostgreSQL, and what that problem may be.

This is by no extent an exhaustive guide, nor do the steps need to be done in any specific order. But it’s rather a set of initial steps that can be taken to help find the common offenders quickly, as well as gain new insight as to what the issue may be. A developer may know how the application acts and responds, but the Database Administrator knows how the database acts and responds to the application, and together, the issue can be found.

NOTE: The queries to be executed should be done as a superuser, such as ‘postgres’ or any database user granted the superuser permissions. Limited users will either be denied or have data omitted.

Step 0 - Information Gathering

Get as much information as possible from whoever says the database seems slow; specific queries, applications connected, timeframes of the performance slowness, etc. The more information they give the easier it will be to find the issue.

Step 1 - Check pg_stat_activity

The request may come in many different forms, but if “slowness” is the general issue, checking pg_stat_activity is the first step to understand just what’s going on. The pg_stat_activity view (documentation for every column in this view can be found here) contains a row for every server process / connection to the database from a client. There is a handful of useful information in this view that can help.

NOTE: pg_stat_activity has been known to change structure over time, refining the data it presents. Understanding of the columns themselves will help build queries dynamically as needed in the future.

Notable columns in pg_stat_activity are:

query: a text column showing the query that’s currently being executed, waiting to be executed, or was last executed (depending on the state). This can help identify what query / queries a developer may be reporting are running slowly.
client_addr: The IP address for which this connection and query originated from. If empty (or Null), it originated from localhost.
backend_start, xact_start, query_start: These three provide a timestamp of when each started respectively. Backend_start represents when the connection to the database was established, xact_start is when the current transaction started, and query_start is when the current (or last) query started.
state: The state of the connection to the database. Active means it’s currently executing a query, ‘idle’ means it’s waiting further input from the client, ‘idle in transaction’ means it’s waiting for further input from the client while holding an open transaction. (There are others, however their likelihood is rare, consult the documentation for more information).
datname: The name of the database the connection is currently connected to. In multiple database clusters, this can help isolate problematic connections.
wait_event_type and wait_event: These columns will be null when a query isn’t waiting, but if it is waiting they will contain information on why the query is waiting, and exploring pg_locks can identify what it’s waiting on. (PostgreSQL 9.5 and before only has a boolean column called ‘waiting’, true if waiting, false if not.

1.1. Is the query waiting / blocked?

If there is a specific query or queries that are “slow” or “hung”, check to see if they are waiting for another query to complete. Due to relation locking, other queries can lock a table and not let any other queries to access or change data until that query or transaction is done.

PostgreSQL 9.5 and earlier:

SELECT * FROM pg_stat_activity WHERE waiting = TRUE;

PostgreSQL 9.6:

SELECT * FROM pg_stat_activity WHERE wait_event IS NOT NULL;

PostgreSQL 10 and later (?):

SELECT * FROM pg_stat_activity WHERE wait_event IS NOT NULL AND backend_type = 'client backend';

The results of this query will show any connections currently waiting on another connection to release locks on a relation that is needed.

If the query is blocked by another connection, there are some ways to find out just what they are. In PostgreSQL 9.6 and later, the function pg_blocking_pids() allows the input of a process ID that’s being blocked, and it will return an array of process ID’s that are responsible for blocking it.

PostgreSQL 9.6 and later:

SELECT * FROM pg_stat_activity 
WHERE pid IN (SELECT pg_blocking_pids(<pid of blocked query>));

PostgreSQL 9.5 and earlier:

SELECT blocked_locks.pid     AS blocked_pid,
         blocked_activity.usename  AS blocked_user,
         blocking_locks.pid     AS blocking_pid,
         blocking_activity.usename AS blocking_user,
         blocked_activity.query    AS blocked_statement,
         blocking_activity.query   AS current_statement_in_blocking_process
   FROM  pg_catalog.pg_locks         blocked_locks
    JOIN pg_catalog.pg_stat_activity blocked_activity  ON blocked_activity.pid = blocked_locks.pid
    JOIN pg_catalog.pg_locks         blocking_locks 
        ON blocking_locks.locktype = blocked_locks.locktype
        AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE
        AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
        AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
        AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
        AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
        AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
        AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
        AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
        AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
        AND blocking_locks.pid != blocked_locks.pid
    JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
   WHERE NOT blocked_locks.GRANTED;

(Available from the PostgreSQL Wiki).

These queries will point to whatever is blocking a specific PID that’s provided. With that, a decision can be made to kill the blocking query or connection, or let it run.

Step 2 - If the queries are running, why are they taking so long?

2.1. Is the planner running queries efficiently?

If a query (or set of queries) in question has the status of ‘active’, then it’s actually running. If the whole query isn’t available in pg_stat_activity, fetch it from the developers or the postgresql log and start exploring the query planner.

EXPLAIN SELECT * FROM postgres_stats.table_stats t JOIN hosts h ON (t.host_id = h.host_id) WHERE logged_date >= '2018-02-01' AND logged_date < '2018-02-04' AND t.india_romeo = 569;
Nested Loop  (cost=0.280..1328182.030 rows=2127135 width=335)
  ->  Index Scan using six on victor_oscar echo  (cost=0.280..8.290 rows=1 width=71)
          Index Cond: (india_romeo = 569)
  ->  Append  (cost=0.000..1306902.390 rows=2127135 width=264)
        ->  Seq Scan on india_echo romeo  (cost=0.000..0.000 rows=1 width=264)
                Filter: ((logged_date >= '2018-02-01'::timestamp with time zone) AND (logged_date < '2018-02-04'::timestamp with time zone) AND (india_romeo = 569))
        ->  Seq Scan on juliet victor_echo  (cost=0.000..437153.700 rows=711789 width=264)
                Filter: ((logged_date >= '2018-02-01'::timestamp with time zone) AND (logged_date < '2018-02-04'::timestamp with time zone) AND (india_romeo = 569))
        ->  Seq Scan on india_papa quebec_bravo  (cost=0.000..434936.960 rows=700197 width=264)
                Filter: ((logged_date >= '2018-02-01'::timestamp with time zone) AND (logged_date < '2018-02-04'::timestamp with time zone) AND (india_romeo = 569))
        ->  Seq Scan on two oscar  (cost=0.000..434811.720 rows=715148 width=264)
                Filter: ((logged_date >= '2018-02-01'::timestamp with time zone) AND (logged_date < '2018-02-04'::timestamp with time zone) AND (india_romeo = 569))

This example shows a query plan for a two table join that also hits a partitioned table. We’re looking for anything that can cause the query to be slow, and in this case the planner is doing several Sequential Scans on partitions, suggesting that they are missing indexes. Adding indexes to these tables for column ‘india_romeo’ will instantly improve this query.

Things to look for are sequential scans, nested loops, expensive sorting, etc. Understanding the query planner is crucial to making sure queries are performing the best way possible, official documentation can be read for more information here.

2.2. Are the tables involved bloated?

If the queries are still feeling slow without the query planner pointing at anything obvious, it’s time to check the health of the tables involved. Are they too big? Are they bloated?

SELECT n_live_tup, n_dead_tup from pg_stat_user_tables where relname = ‘mytable’;
n_live_tup  | n_dead_tup
------------+------------
      15677 |    8275431
(1 row)

Here we see that there are many times more dead rows than live rows, which means to find the correct rows, the engine must sift through data that’s not even relevant to find real data. A vacuum / vacuum full on this table will increase performance significantly.

Step 3 - Check the logs

If the issue still can’t be found, check the logs for any clues.

FATAL / ERROR messages:

Look for messages that may be causing issues, such as deadlocks or long wait times to gain a lock.

Checkpoints

Hopefully log_checkpoints is set to on, which will write checkpoint information to the logs. There are two types of checkpoints, timed and requested (forced). If checkpoints are being forced, then dirty buffers in memory must be written to disk before processing more queries, which can give a database system an overall feeling of “slowness”. Increasing checkpoint_segments or max_wal_size (depending on the database version) will give the checkpointer more room to work with, as well as help the background writer take some of the writing load.

Step 4 - What’s the health of the host system?

If there’s no clues in the database itself, perhaps the host itself is overloaded or having issues. Anything from an overloaded IO chanel to disk, memory overflowing to swap, or even a failing drive, none of these issues would be apparent with anything we looked at before. Assuming the database is running on a *nix based operating system, here are a few things that can help.

4.1. System load

Using ‘top’, look at the load average for the host. If the number is approaching or exceeding the number of cores on the system, it could be simply too many concurrent connections hitting the database bringing it to a crawl to catch up.

load average: 3.43, 5.25, 4.85

4.2. System memory and SWAP

Using ‘free’, check to see if SWAP has been used at all. Memory overflowing to SWAP in a PostgreSQL database environment is extremely bad for performance, and many DBA’s will even eliminate SWAP from database hosts, as an ‘out of memory’ error is more preferable than a sluggish system to many.

If SWAP is being used, a reboot of the system will clear it out, and increasing total system memory or re-configuring memory usage for PostgreSQL (such as lowering shared_buffers or work_mem) may be in order.

[postgres@livedb1 ~]$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7986         225        1297          12        6462        7473
Swap:          7987        2048        5939

4.3. Disk access

PostgreSQL attempts to do a lot of its work in memory, and spread out writing to disk to minimize bottlenecks, but on an overloaded system with heavy writing, it’s easily possible to see heavy reads and writes cause the whole system to slow as it catches up on the demands. Faster disks, more disks and IO channels are some ways to increase the amount of work that can be done.

Tools like ‘iostat’ or ‘iotop’ can help pinpoint if there is a disk bottleneck, and where it may be coming from.

4.4. Check the logs

If all else fails, or even if not, logs should always be checked to see if the system is reporting anything that’s not right. We already discussed checking the postgresql.logs, but the system logs can give information about issues such as failing disks, failing memory, network problems, etc. Any one of these issues can cause the database to act slow and unpredictable, so a good understanding of perfect health can help find these issues.

PostgreSQL Management & Automation with ClusterControl

Learn about what you need to know to deploy, monitor, manage and scale PostgreSQL

Download the Whitepaper

Step 5 - Something still not make sense?

Even the most seasoned administrators will run into something new that doesn’t make sense. That’s where the global PostgreSQL community can come in to help out. Much like step #0, the more clear information given to the community, the easier they can help out.

5.1. PostgreSQL Mailing Lists

Since PostgreSQL is developed and managed by the open source community, there are thousands of people who talk through the mailing lists to discuss countless topics including features, errors, and performance issues. The mailing lists can be found here, with pgsql-admin and pgsql-performance being the most important for looking for help with performance issues.

5.2. IRC

Freenode hosts several PostgreSQL channels with developers and administrators all over the world, and it’s not hard to find a helpful person to track down where issues may be coming from. More information can be found on the PostgreSQL IRC page.

Tags:

PostgreSQL

postgres performance tuning

↧

Luca Ferrari: Statements with RETURNING: Perl and Java clients

May 30, 2018, 5:00 pm

≫ Next: Craig Kerstiens: Fun with SQL: Window functions in Postgres

≪ Previous: Brian Fehrle: Why is PostgreSQL Running Slow? Tips & Tricks to Get to the Source

PostgreSQL statements support the ~RETURNING~ predicate that allows a statement that manipulates tuples to return a set of column of such tuples. It is easy to use such statements on a client-basis to get back data not available when the query has been written.

Statements with RETURNING: Perl and Java clients

Statements such as ~INSERT~, ~DELETE~ and ~UPDATE~ can have a ~RETURNING~ predicate that allows to get back the data that the statement has manipulated. On a theoretical point of view, it is like the following two statements are executed:

INSERT|UPDATE|DELETEtuples;SELECTabove_tuples;

From within the database connection, such ~RETURNING~ statement can be very useful to see which tuples have been modified, and from a client perspective it can be used to get random and serial-based data. Consider a simple table defined as follows:

CREATETABLEfoo(pkserial,rvfloat);

and consider the following simple statement to insert values:

INSERTINTOfoo(rv)SELECTrandom()FROMgenerate_series(1,10);

The above query inserts 10 tuples with ~rv~ set to a random value and ~pk~ set to the next value of the associated sequence. In other words, it is not possible to know in advance what values have been inserted.

Thanks to ~RETURNING~ this knowledge is pushed back to the client, and can be consumed as a normal result set, that means as if the client issued a ~SELECT~ statement.

As a simple example, consider the following Perl client:

↧

Craig Kerstiens: Fun with SQL: Window functions in Postgres

June 1, 2018, 6:15 am

≫ Next: Vincenzo Romano: History table: my (very own) design pattern

≪ Previous: Luca Ferrari: Statements with RETURNING: Perl and Java clients

Today we continue to explore all the powerful and fun things you can do with SQL. SQL is a very expressive language and when it comes to analyzing your data there isn’t a better option. You can see the evidence of SQL’s power in all the attempts made by NoSQL databases to recreate the capabilities of SQL. So why not just start with a SQL database that scales? (Like my favorites, Postgres and Citus.)

Today, in the latest post in our ‘Fun with SQL’ series (earlier blog posts were about recursive CTEs, generate_series, and relocating shards on a Citus database cluster), we’re going to look at window functions in PostgreSQL. Window functions are key in various analytic and reporting use cases where you want to compare and contrast data. Window functions allow you to compare values between rows that are somehow related to the current row. Some practical uses of window functions can be:

Finding the first time all users performed some action
Finding how much each users bill increased or decreased from the previous month
Find where all users ranked for some sub-grouping

The basic structure of a window function in Postgres

Window functions within PostgreSQL have a built in set of operators and perform their action across some specific key. But they can have two different syntaxes that express the same thing. Let’s take a look at a simple window function expressed two different ways:

The first format

SELECTlast_name,salary,department,rank()OVER(PARTITIONBYdepartmentORDERBYsalaryDESC)FROMemployees;

The second format

SELECTlast_name,salary,department,rank()overwFROMemployeesWINDOWwas(partitionbydepartmentorderbysalary).

With the first query we can see the window function is inlined, where as the second it is broken our separtely. Both of the above queries produce the same results:

last_name|salary|department|rank-----------+---------+--------------+-------Jones|45000|Accounting|1Williams|37000|Accounting|2Smith|55000|Sales|1Adams|50000|Sales|2Johnson|40000|Marketing|1

Both of these show the last name of employees, their salary, their department—and then rank where they fall in terms of salary in their department. You could easily combine this with a CTE to then find only the highest paying (where rank = 1) or second highest paying (where rank = 2) in each department.

What can you do with window functions in Postgres?

Within Postgres there are a number of window functions that each perform a different operation. You can check the PostgreSQL docs for the full list, but for now we’ll walk through a few that are particularly interesting:

rank - As we saw in the earlier example, rank will show where the row ranks in order of the window order.
percent_rank - Want to compute the percent where the row falls within your window order? percent_rank will give you the percentage ranking based on your window think of it as ((rank - 1) / (total rows - 1))
lag - Want to do your own operation between rows? Lag will give you the row value xrows before your current row. Want to the value for future rows? You can use lead for that. A great example of this could be computing month over month growth
ntile - Want to compute what percentile values fall in? ntile allows you to specify a percentile to group buckets into. For 4 quartiles you would use ntile(4), for percentile of each row you would use ntile(100).

Hopefully you’ll find window functions as useful as we do here at Citus. If you have questions on using them the PostgreSQL docs are a great resource or feel free to jump into our Slack channel

↧