Quantcast
Channel: Planet PostgreSQL
Viewing all 9730 articles
Browse latest View live

Dimitri Fontaine: The Art of PostgreSQL: The Transcript, part I

$
0
0

This article is a transcript of the conference I gave at Postgres Open 2019, titled the same as the book: The Art of PostgreSQL. It’s availble as a video online at Youtube if you want to watch the slides and listen to it, and it even has a subtext!

Some people still prefer to read the text, so here it is.


Jonathan Katz: Just Upgrade: How PostgreSQL 12 Can Improve Your Performance

$
0
0

PostgreSQL 12, the latest version of the "world's most advanced open source relational database," is being released in the next few weeks, barring any setbacks. This follows the project's cadence of providing a raft of new database features once a year, which is quite frankly, amazing and one of the reasons why I wanted to be involved in the PostgreSQL community.

In my opinion, and this is a departure from previous years, PostgreSQL 12 does not contain one or two single features that everyone can point to and say that "this is the 'FEATURE' release," (partitioning and query parallelism are recent examples that spring to mind). I've half-joked that the theme of this release should be "PostgreSQL 12: Now More Stable" -- which of course is not a bad thing when you are managing mission critical data for your business.

And yet, I believe this release is a lot more than that: many of the features and enhancements in PostgreSQL 12 will just make your applications run better without doing any work other than upgrading!

(...and maybe rebuild your indexes, which, thanks to this release, is not as painful as it used to be)!

It can be quite nice to upgrade PostgreSQL and see noticeable improvements without having to do anything other than the upgrade itself. A few years back when I was analyzing an upgrade of PostgreSQL 9.4 to PostgreSQL 10, I measured that my underlying application was performing much more quickly: it took advantage of the query parallelism improvements introduced in PostgreSQL 10. Getting these improvements took almost no effort on my part (in this case, I set the max_parallel_workers config parameter).

Having applications work better by simply upgrading is a delightful experience for users, and it's important that we keep our existing users happy as more and more people adopt PostgreSQL.

So, how can PostgreSQL 12 make your applications better just by upgrading? Read on!

Dimitri Fontaine: The Art of PostgreSQL: The Transcript, part II

$
0
0

This article is a transcript of the conference I gave at Postgres Open 2019, titled the same as the book: The Art of PostgreSQL. It’s availble as a video online at Youtube if you want to watch the slides and listen to it, and it even has a subtext!

Some people still prefer to read the text, so here it is. This text is the second part of the transcript of the video. The first part is available at The Art of PostgreSQL: The Transcript, part I .

damien clochard: New version of PostgreSQL Anonymizer and more...

$
0
0

One year I started a side-project cvalled PostgreSQL Anonymizer to study and learn various ways to protect privacy using the power of PostgreSQL. The project is now part of the Dalibo Labs intiative and we’ve published a new version last week…

This seems like a nice moment to analyze the progress we’ve made, how the GDPR is changing the game and where we’re going….

GPDR : Sanctions are coming

While I was working on this, the landscape has changed… When the GPDR was implemented in May 2018, one the biggest questions was if the fines would be signification enough to force a real change in corporate data policies….

From what we can see, the fines are GPDR fines are starting to fall, just during July 2019 : Bristish Airways got 204 M€ and Marriott Hotels got 110 M€.

There are also smaller fines for smaller companies, what’s interesting is that the biggest fines are related to the [Article 32] and the « Insufficient technical and organisational measures to ensure information security »

In other words : Data Leaks.

Here’s where anonymisation can help ! Based on my experience, we can reduce the risks of leaking personnal information by limiting the number of environments where the data is hosted. In many staging setups such as pre-production, training, development, CI, analytics, etc… the real data is not absolutely required. With a strong anonymization policy we can limit real data only where it is needed and work on fake/random data everywhere else. When Anonymization is done the right way, the anonymized datasets are not concerned by the GPDR.

In a nutshell, anonymization is powerful method to reduce your attack surface and its a key to limit the risks GPDR penalties related to data leaks.

This is why we’re investing a lot of effort to develop masking tools directly inside PostgreSQL !

Major Improvements

Over the last month, I’ve worked on different aspect of the extension, especially :

Security Labels

One of the main drawback of the current implementation of PostgreSQL is that Masking Rules are declared using the COMMENT syntax which can be annoying if your database already has comments.

Thanks to an idea from Alvaro Herrera, I’m currently working on a new declaration syntax based on Security Labels, a little known feature also used by the sepgsql extension

SECURITYLABELFORanonONCOLUMNpeople.zipcodeIS'MASKED WITH FUNCTION anon.fake_zipcode()'

This should be available in a few weeks. Of cource, the former syntax will
still be supported for backward compatibility.

Let’s talk !

GDPR and data privacy are two very hot topics ! I will be talking about those subjects in various events in the forthcoming weeks :

If you have any ideas or comments on PostgreSQL Anonymizer or more generally about protecting data privacy with progress, please send me a message at damien@dalibo.com.

John Naylor: PG12: A Few Special-Case Performance Enhancements

$
0
0
With every new release of PostgreSQL, there are a range of performance enhancements. Some are system-wide and affect every user, but most are highly specific to a certain use case. In this post, I am going to briefly highlight three improvements in PG12 that speed up certain operations. 1. Minimal decompression of TOAST values TOAST […]

Hans-Juergen Schoenig: Using “Row Level Security” to make large companies more secure

$
0
0

Large companies and professional business have to make sure that data is kept secure. It is necessary to defend against internal, as well as external threats. PostgreSQL provides all the necessities a company needs to protect data and to ensure that people can only access what they are supposed to see. One way to protect data is “Row Level Security”, which has been around for some years now. It can be used to reduce the scope of a user by removing rows from the result set automatically. Usually people apply simple policies to do that. But, PostgreSQL Row Level Security (RLS) can do a lot more. You can actually control the way RLS behaves using configuration tables.

Configuring access restrictions dynamically

Imagine you are working for a large cooperation. Your organization might change, people might move from one department to the other or your new people might join up as we speak. What you want is that your security policy always reflects the way your company really is. Let us take a look at a simple example:

CREATE TABLE t_company
(
     id            serial,
     department    text NOT NULL,
     manager       text NOT NULL
);

CREATE TABLE t_manager
(
     id      serial,
     person  text,
     manager text,
     UNIQUE (person, manager)
);

I have created two tables. One will know, who is managing which department. The second table knows, who will report to him. The goal is to come up with a security policy, which ensures that somebody can only see own data or data from departments on lower levels. In many cases row level policies are hardcoded – in our case we want to be flexible and configure visibility given the data in the tables.

Let us populate the tables:

INSERT INTO t_manager (person, manager)
VALUES ('hans', NULL),
       ('paula', 'hans'),
       ('berta', 'hans'),
       ('manuel', 'paula'),
       ('mike', 'paula'),
       ('joe', 'berta'),
       ('jack', 'berta'),
       ('jane', 'berta')
;

hierarchy

 

As you can see “hans” has no manager. “paula” will report directly to “hans”. “manuel” will report to “paula” and so on.

In the next step we can populate the company table:

INSERT INTO t_company (department, manager)
VALUES ('dep_1_1', 'joe'),
       ('dep_1_2', 'jane'),
       ('dep_1_3', 'jack'),
       ('dep_2_1', 'mike'),
       ('dep_2_2', 'manuel'),
       ('dep_1', 'berta'),
       ('dep_2', 'paula'),
       ('dep', 'hans')
;

For the sake of simplicity, I have named those departments in a way that they reflect the hierarchy in the company. The idea is to make the results easier to read and easier to understand. Of course, any other name will work just fine as well.

Defining row level policies in PostgreSQL

To enable row level security (RLS) you have to run ALTER TABLE … ENABLE ROW LEVEL SECURITY:

ALTER TABLE t_company ENABLE ROW LEVEL SECURITY;

What is going to happen is that all non-superusers, or users who are marked as BYPASSRLS, won’t see any data anymore. By default, PostgreSQL is restrictive and you have to define a policy to configure the desired scope of users. The following policy uses a subselect to travers our organization:


CREATE POLICY my_fancy_policy
  ON t_company
  USING (manager IN ( WITH RECURSIVE t AS 
                        (
                           SELECT current_user AS person, NULL::text AS manager
                           FROM t_manager
                           WHERE manager = CURRENT_USER
                           UNION ALL
                           SELECT m.person, m.manager
                           FROM t_manager m
                           INNER JOIN t ON t.person = m.manager
                        )
                        SELECT person FROM t
                    )
        )
;

What you can see here is that a policy can be pretty sophisticated. It is not just a simple expression but can even be a more complex subselect, which uses some configuration tables to decide on what to do.

PostgreSQL row level security in action

Let us create a role now:

CREATE ROLE paula LOGIN;
GRANT ALL ON t_company TO paula;
GRANT ALL ON t_manager TO paula;

paula is allowed to log in and read all data in t_company and t_manager. Being able to read the table in the first place is a hard requirement to make PostgreSQL even consider your row level policy.

Once this is done, we can set the role to paula and see what happens:

test=> SET ROLE paula;
SET
test=>; SELECT * FROM t_company;

id  | department | manager
----+------------+---------
4   | dep_2_1    | mike
5   | dep_2_2    | manuel
7   | dep_2      | paula
(3 rows)

As you can see paula is only able to see herself and the people in her department, which is exactly what we wanted to achieve.

Let us switch back to superuser now:

SET ROLE postgres;

We can try the same thing with a second user and we will again achieve the desired results:


CREATE ROLE hans LOGIN;

GRANT ALL ON t_company TO hans;
GRANT ALL ON t_manager TO hans;

The output is as expected:

test=# SET role hans;
SET
test=> SELECT * FROM t_company;
id  | department | manager
----+------------+---------
1   | dep_1_1    | joe
2   | dep_1_2    | jane
3   | dep_1_3    | jack
4   | dep_2_1    | mike
5   | dep_2_2    | manuel
6   | dep_1      | berta
7   | dep_2      | paula
8   | dep        | hans
(8 rows)

Row level security and performance

Keep in mind that a policy is basically a mandatory WHERE clause which is added to every query to ensure that the scope of a user is limited to the desired subset of data. The more expensive the policy is, the more impact it will have on performance. It is therefore highly recommended to think twice and to make sure that your policies are reasonably efficient to maintain good database performance.

The performance impact of row level security in PostgreSQL (or any other SQL database) cannot easily be quantified because it depends on too many factors. However, keep in mind – there is no such thing as a free lunch.

If you want to learn more about Row Level Security check out my post about PostgreSQL security.

The post Using “Row Level Security” to make large companies more secure appeared first on Cybertec.

Andrew Dunstan: Release 11 of the PostgreSQL Buildfarm client

$
0
0
Release 11 of the PostgreSQL Buildfarm client is now available. The release includes numerous bug fixes plus following features: Allow a list of branches as positional arguments to run_branches.pl This overrides what is found in the config file. The list can’t include metabranches like ALL, nor can it contain regexes. improve diagnostic capture for git […]

Dimitri Fontaine: The Art of PostgreSQL: The Transcript, part III

$
0
0

This article is a transcript of the conference I gave at Postgres Open 2019, titled the same as the book: The Art of PostgreSQL. It’s availble as a video online at Youtube if you want to watch the slides and listen to it, and it even has a subtext!

Some people still prefer to read the text, so here it is. This text is the third part of the transcript of the video.

The first part is available at The Art of PostgreSQL: The Transcript, part I.

The second part is available at The Art of PostgreSQL: The Transcript, part II.


damien clochard: New version of PostgreSQL Anonymizer and more...

$
0
0

Il y a un an, j’ai lancé un projet appelé PostgreSQL Anonymizer pour étudier et aprrendre différentes techniques de protection des données privées en utilisant la puissance de PostgreSQL. Ce projet fait maintenant partie de l’initiative Dalibo Labs et nous venons de publier une nouvelle version la semaine dernière….

C’est l’occasion d’analyser les progrès réalisés, l’impact du RGPD et les futures directions du projet.

RPGD : les premières sanctions arrivent

Depuis le début de ce projet, le paysage a radicalement changé…. Lorsque le RPGD est entré en vigueur en mai 2018, uns des grandes questions concerant les pénalités financières infligées. Est-ce que les montants serait assez significatifs pour provoquer un réel changement dans la manière dont les entreprises gèrent les données personnelles ?

De ce que l’on peut voir, les amendes RGPD ont commencé à tomber depuis le début d’année et rien que pour le mois de juillet 2019 : Bristish Airways a reçu une sanction de 204 M€ et les hotels Marriott Hotels en a reçu une de 110 M€.

Il y a aussi des amemdes moins élevée pour des sociétés de petites tailles. Ce qui est intéressant, c’est que les sanctions les plus importantes concernent l’Article 32 et « des mesures techniques et organisationnelles insuffisante pour garantir le sécurity de l’information ».

En d’autres termes : des fuites de données.

Voici un domaine ou l’anonymisation peut aider ! Sur la base de mon expérience, il est possible de réduire les risques de fuites d’informations sensibles en limitant la dispersion de ces données. Dans beaucoup d’environnements ( pré-production, formation, développement, intégration, statistiques, etc…) les données réelles ne sont pas absolument nécessaires. Avec une politique d’anonymisation très forte, on peut resteindre l’usage des données personnelles uniquement lorsque c’est indispensable et travailler sur des données aléatoires ou artificielles partout ou c’est possible… Lorsque l’anonymisation est faite correctement, les jeux de données anonymisées ne sont plus soumises aux contraintes du RGPD.

Pour résumer: l’anonymisation est une méthode puissante pour réduire la surface d’attaque et c’est un clé pour limiter les risques
liés aux fuites de données.

C’est pourquoi nous investissons beaucoup d’efforts pour développer des outils de masquage directement à l’intérieur de PostgreSQL !

Des améliorations majeures

Le mois dernier, j’ai travaillé sur différents aspects de l’extension, notamment :

Security Labels

Un défaut de l’implémentation actuel de PostgreSQL Anonymizer est que les Règles de Masquage sont déclarées avec la syntaxe COMMENT, ce qui peut être ennuyeux lorsque votre modèle de données contient déjà des commentaires.

Grace à une idée d’Alvaro Herrera, je travaille actuellement sur une nouvelle syntaxe basée sur les Labels de Sécurité, une fonctionnalité méconnue qui est utilisée principalement par l’extension sepgsql.

SECURITYLABELFORanonONCOLUMNpeople.zipcodeIS'MASKED WITH FUNCTION anon.fake_zipcode()'

Cette syntaxe devrait disponible dans quelques semaines. Bien sur la syntaxe précédentes sera toujours supportées pour assurer la retro-compatibilité.

Discutons !

Le RGPD et la protectée des données personnelles sont deux sujets brulants ! Je parlerai de tout cela dans différents événement cet automne, notamment:

If you have any ideas or comments on PostgreSQL Anonymizer or more generally about protecting data privacy with progress, please send me a message at damien@dalibo.com.

Robert Haas: Synchronous Replication is a Trap

$
0
0
Almost ten years ago, I wrote a blog post -- cautiously titled What Kind of Replication Do You Need? -- in which I suggested that the answer was probably "asynchronous." At that time, synchronous replication was merely a proposed feature that did not exist in any official release of PostgreSQL; now, it's present in all supported versions and has benefited from several rounds of enhancements. Demand for this feature seems to be high, and there are numerous blog posts about it available (EnterpriseDB, Cybertec, Ashnik, OpsDash), but in my opinion, there is not nearly enough skepticism about the intrinsic value of the technology. I think that a lot of people are using this technology and getting little or no benefit out of it, and some are actively hurting themselves.

If you're thinking about deploying synchronous replication -- or really any technology -- you should start by thinking carefully about exactly what problem you are hoping to solve. Most people who are thinking about synchronous replication seem to be worried about data durability; that is, they want to minimize the chances that a transaction will be lost in the event of a temporary or permanent server loss. This is where I think most people hope for more than the feature is really capable of delivering; more on that below. However, some people are concerned with data consistency; that is, they want to make sure that if they update data on the master and then immediately query the data on a slave, the answer they get is guaranteed to reflect the update. At least one person with whom I spoke was concerned with replication lag; in that environment, the master could do more work than the standby could replay, and synchronous replication kept the two from diverging arbitrarily far from each other.

I have few reservations about the use of synchronous replication for data consistency.  For this to work, you need to configure the master with synchronous_commit = 'remote_apply' and set synchronous_standby_names to a value that will cause it to wait for all of the standbys to respond to every commit (see the documentation for details). You'll need PostgreSQL 9.6 or higher for this to work. Also, don't forget testing and monitoring. Remember that if one of your standbys goes down, commits will stall, and you'll need to update synchronous_standby_names to remove the failed standby (and reduce the number of servers for which you are to wait by one).  You can reverse those changes once the standby is back online and caught up with the master. Perhaps in the future we'll be able to do this sort of thing with less manual reconfiguration, but I think this kind of solution is already workable for many people. If the performance hit from enabling synchronous replication is acceptable to you, and if the benefit of not having to worry about the data on the standby being slightly stale is useful to you, this kind of configuration is definitely worth considering.

I also don't think it's a big problem to use synchronous replication to control replication lag. It doesn't seem like an ideal tool, because if your goal is to prevent the standby from getting excessively far behind the master, you would probably be willing to accept a certain amount of lag when a burst of activity occurs on the master, as long as it doesn't continue for too long. Synchronous replication will not give you that kind of behavior.  It will wait at every commit for that commit to be received, written, or applied on the remote side (depending on the value you choose for synchronous_standby_names; see documentation link above). Perhaps someday we'll have a feature that slows down the master only when lag exceeds some threshold; that would be a nicer solution. In the meantime, using synchronous replication is a reasonable stopgap.

Where I think a lot of people go wrong is when they think about using synchronous replication for data durability. Reliable systems that don't lose data are built out of constituent parts that are unreliable and do lose data; durability and reliability are properties of the whole system, not a single component. When we say that a software solution such as synchronous replication improves data durability, what we really mean is that it helps you avoid the situation where you think the data is reliably persisted but it really isn't. After all, neither synchronous replication nor any other software system can prevent a disk from failing or a network connection from being severed; they can only change the way the system responds when such an event occurs.

The baseline expectation of a software system that talks to PostgreSQL is - or ought to be - that when you send a COMMIT command and the database confirms that the command was executed successfully, the changes made by that transaction have been persisted. In the default configuration, "persisted" means "persisted to the local disk." If you set synchronous_commit = off, you weaken that guarantee to "persisted in memory, and we'll get it on disk as soon as we can." If you set synchronous_standby_names, you strengthen it to "persisted to the local disk and also some other server's disk" - or perhaps multiple servers, depending on the value you configure. But the key thing here is that any configuration changes that you make in this area only affect the behavior of the COMMIT statement, and therefore they only have any value to the extent that your application pays attention to what happens when it runs COMMIT.

To make this clearer, let's take an example. Suppose there's a human being - I'll call her Alice - who sits at a desk somewhere. People give orders (which are critical data!) to Alice, and she enters them into a web application, which stores the data into a PostgreSQL database. We have basically three components here: Alice, the web application, and the database. Any of them can fail, and nothing we do can prevent them from failing. The database server can go down due to a hardware or software fault; similarly for the web server; and Alice can get sick. We can reduce the probability of such failures by techniques such as RAID 10 and vaccinations, but we can't eliminate it entirely. What we can try to do is create a system that copes with such failures without losing any orders.

If there is a transient or permanent failure of Alice, it's probably safe to say that no orders will be lost. Maybe the people who normally give orders to Alice will leave them in a pile on her desk, or maybe they'll notice that Alice is out and come back later in the hopes of handing them to her directly once she's back in the office. Either way, the orders will eventually get entered. There are potential failure modes here, such as the building burning down and taking the unentered order papers with it, and there are things that can be done to mitigate such risks, but that's outside the scope of this blog post.

A transient or permanent of failure of the web server is a more interesting case. A total failure of the web server is unlikely to cause any data loss, because Alice will be aware that the failure has occurred. If she goes to the web application where she spends most of her time and it fails to load, she'll hold on to any pending order papers at her desk until the application comes back up, and then deal with the backlog. Even if she's already got the web page open loaded, she'll certainly notice if she hits the button to save the latest order and gets an error page back from the browser. Really, the only way things can go wrong here is if the web application experience some kind of partial failure wherein it fails to save the order to the database but doesn't make it clear to Alice that something has gone wrong. In that case, Alice might discard the order paper on the erroneous belief that the data has been saved.  But otherwise, we should be OK.

Notice that the key here is good error reporting: as long as Alice knows whether or not a particular transaction succeeded or failed, she'll know what to do.  Even if she's uncertain, that's OK: she can go search the database for the order that she just tried to enter, and see if it's there.  If it's not, she can enter it again before discarding the order paper.

Now, let's think about what happens if the database fails catastrophically, such that the entire server is lost. Obviously, if we don't have a replica, we will have lost data. If we do have a replica, we will probably have the data on the replica, and everything will be OK. However, if some orders were entered on the master but not yet replicated to the standby at the time of the failure, they might no longer exist on the new master. If Alice still has the order papers and hears about the failure, we're still OK: she can just reenter them. However, if she destroys every order paper as soon as the web application confirms that it's been saved, then we've got a problem.

This is where synchronous replication can help. If the database administrator enables synchronous replication, then the database server won't acknowledge a COMMIT from the web application until the commit has been replicated to the standby. If the application is well-designed, it won't tell Alice that the order is saved until the database acknowledges the COMMIT. Therefore, when Alice gets a message saying that the order has been saved, it's guaranteed to be saved on both the master and the standby, and she can destroy the original order paper with no risk of data loss - unless both the master and standby fail, but if you're worried about that scenario, you should have multiple standbys.  So, we seem to have constructed a pretty reliable system here, and synchronous replication is an important part of what makes it reliable.

Notice, however, the critical role of the application here. If the application tells Alice that the order is saved before the commit is acknowledged, then the whole thing falls apart. If Alice doesn't pay attention to whether or not the order was confirmed as saved, then the whole thing falls apart. She might, for example, close frozen browser window and fail to recheck whether that order went through after reopening the application. Every step of the pipeline has to be diligent about reporting failures back to earlier stages, and there has to be a retry mechanism if things do fail, or if they may have failed. Remember, there's nothing at all we can do to remove the risk that the database server will fail, or that the web application will fail, or even that Alice will fail. What we can do is make sure that if the database fails, the web application knows about it; if the web application fails, Alice knows about it; and if Alice fails, the people submitting the orders know about it. That's how we create a reliable system.

So, the "trap" of synchronous replication is really that you might focus on a particular database feature and fail to see the whole picture. It's a useful tool that can supply a valuable guarantee for applications that are built carefully and need it, but a lot of applications probably don't report errors reliably enough, or retry transactions carefully enough, to get any benefit.  If you have an application that's not careful about such things, turning on synchronous replication may make you feel better about the possibility of data loss, but it won't actually do much to prevent you from losing data.

Pavel Stehule: pspg can be used like csv viewer

$
0
0
Last week I worked on CSV parser and formatter. Now I integrated both components to `pspg`, and then `pspg` can be used like csv viewer.

cat obce.csv | pspg --csv

Ernst-Georg Schmid: cloudfs_fdw

$
0
0
Since I needed a Foreign Data Wrapper for files stored on S3, and the ones I found did things like loading the whole file in memory before sending the first rows, I wrote my own, using Multicorn.

Along the way, I discovered libraries like smart-open and ijson that allow to stream various file formats from various filesystems - and so this escalated a bit, into cloudfs_fdw.

It currently supports CSV and JSON files from S3, HTTP/HTTPS sources and local or network  filesystems but since smart-open supports more than that (e.g. HDFS, SSH), it certainly can be extended if needed.

For now, have fun.

Dave Conlin: Postgres Execution Plans — Field Glossary

$
0
0

Postgres Execution Plans — Field Glossary

I’ve talked in the past about how useful Postgres execution plans can be. They contain so much useful information that here at pgMustard we’ve built a whole tool to visualise and interpret them.

There are lots of guides out there to the basics of execution plans, but a lot are quite scarce on the details — how to interpret particular values, what they really mean, and where the pitfalls are.

We’ve spent a lot of time over the last 18 months learning, clarifying, and downright misinterpreting how each of these fields work — and there’s still further for us to go on that.

But we have come a long way, and I’d like to share the guide that I wish had existed when we started out — a glossary of the most common fields you’ll see on the operations in a query plan, and a detailed description of what each one means.

If you work with query plans often, hopefully you’ll still learn a thing or two by reading this guide start-to-finish, but for people who do less performance analysis, you may want to use it as a reference, to look up fields as and when you need them.

I’ve broken down this glossary into sections, based on which flag to EXPLAIN causes the field to be shown, to make it easier to find your way around:

  • Query Structure Fields — always present.
  • Estimate Fields — present when the COSTS flag is set.
  • Actual Value Fields — present when the ANALYZE flag is set.
  • Buffers Fields — present when the BUFFERS flag is set.
  • Verbose Fields — present when the VERBOSE flag is set.

Query Structure Fields

These fields represent what the plan will actually do: how the database will process the data and return the results for your query. When applicable, they’ll be present whatever flags you use to generate the query plan.

Node Type

The operation the node is performing. The best guides I’ve read on what different operation types actually do are Depesz’s series on different operations and the annotations to the code itself — both in the code for node execution and the planner nodes.

Plans

The child operations executed to provide input to this operation.

Parent Relationship

A guide as to why this operation needs to be run in order to facilitate the parent operation. There are six different possibilities:

  • Outer is the value you’ll see most often. It means “take in the rows from this operation as input, process them and pass them on”.
  • Inner is only ever seen on the second child of join operations, and is always seen there. This is the “inner” part of the loop. ie, for each outer row, we look up its match using this operation.
  • Member is used for all children of append and modifyTable nodes, and on bitmap processing nodes like BitmapAnd and BitmapOr operations.
  • InitPlan: Used for calculations performed before the query can start, eg a constant referred to in the query or the result of a CTE scan.
  • Subquery: The child is a subquery of the parent operation. Since Postgres always uses subquery scans to feed subquery data to parent queries, only ever appears on the children of subquery scans.
  • SubPlan: Like a Subquery, represents a new query, but used when a subquery scan is not necessary.

Filter

When present, this is a filter used to remove rows.

The important thing to note is that this is a filter in the traditional sense: these rows are read in (either from a data source or another operation in the plan), inspected, and then approved or removed based on the filter.

Although similar in purpose to the “Index Cond” that you see on Index Scans or Index Only Scans, the implementation is completely different. In an “Index Cond”, the index is used to select rows based on their indexed values, without examining the rows themselves at all. In fact, in some cases, you might see an “Index Cond” and a “Filter” on the same operation. You can read more about the difference between Index Conditions and Filters in my post about index efficiency (focusing on multi-column indexes).

Parallel Aware

Whether or not the operation will be run in a special mode to support parallelism. Some operations need to be aware that they are running in parallel, for example sequential scans need to know they only need to scan a smaller proportion of the table. Other operations can be run on several threads without each one having any knowledge of the others.

Relation Name

The data source being read/written from. Almost always a table name (including when the data is accessed via an index), but can also be a materialised view or foreign data source.

Alias

The alias used to refer to the Relation Name object.

Estimate Fields

These fields are added to nodes whenever the COSTS flag is set. It is on by default, but you can turn it off.

Total Cost

The estimated total cost of this operation and its descendants. The Postgres query planner often has several different ways it could resolve the same query. It calculates a cost — which is hopefully correlated with the amount of time taken — for each potential plan, and then picks the one with the smallest cost. It’s worth bearing in mind that the costs are unit-free — they’re not designed to convert into time, or disk reads. They’re just supposed to be bigger for slower operations, and smaller for faster ones. You can get a taste of what sort of things they consider by looking at the cost calculation code.

Startup Cost

The estimated amount of overhead necessary to start the operation. Note that, unlike “Actual Startup Time”, this is a fixed value, which won’t change for different numbers of rows.

Plan Rows

The number of rows the planner expects to be returned by the operation. This is a per-loop average, like “Actual Rows”.

Plan Width

The estimated average size of each row returned by the operation, in bytes.

Actual Value Fields

When you run EXPLAIN with the ANALYZE flag set, the query is actually executed — allowing real performance data to be gathered.

Actual Loops

The number of times the operation is executed. For a lot of operations it will have a value of one, but when it is not, there are three different cases:

  1. Some operations can be executed more than once. For example, “Nested Loops” run their “Inner” child once for every row returned by their “Outer” child.
  2. When an operation that would normally only consist of one loop is split across multiple threads, each partial operation is counted as a Loop.
  3. The number of loops can be zero when an operation doesn’t need to be executed at all. For example if a table read is planned to provide candidates for an inner join, but there turns out to be no rows on the other side of the join, the operation can be effectively eliminated.

Actual Total Time

The actual amount of time in milliseconds spent on this operation and all of its children. It’s a per-loop average, rounded to the nearest thousandth of a millisecond.

This can cause some odd occurrences, particularly for “Materialize” nodes. “Materialize”operations persist the data they receive in memory, to allow for multiple accesses, counting each access as a loop. They often get that data from a data read operation, which is only executed once, like in this example:

{   
"Node Type": "Materialize",
"Actual Loops": 9902,
"Actual Total Time": 0.000,
"Plans": [{
"Node Type": "Seq Scan",
"Actual Loops": 1,
"Actual Total Time": 0.035}]}

As you can see, the total time spent across all loops on the “Materialize” node and its child “Seq Scan” is less than 9902 × 0.0005 = 4.951ms, so the per-loop “Actual Total Time” value is less than 0.0005, and is rounded to zero.

Photo by Akshar Dave

So we’re in an odd situation where the “Materialize” node and its children seem to take 0ms in total, while it has a child that takes 0.035ms to execute, implying that the “Materialize” operation on its own somehow takes negative time!

Because the rounding is to such a high degree of accuracy, these issues usually occur only with very fast operations, rather than the slow ones that are often your focus. Still, in the future, I’d love to see query plans also include a value stating how much the “total total” (not the per-loop total), to avoid any of these annoying niggles with rounding, which can be noticable when the number of loops is high.

Actual Startup Time

When we started out, I thought that this was a constant for any operation, so if you halved the number of rows on a linear operation, then this would remain the same, while the “Actual Total Time” would decrease by half the difference between it and the “Actual Startup Time”.

In reality, this is the amount of time it takes to get the first row out of the operation. Sometimes, this is very close to the setup time — for example on a sequential scan which returns all the rows in a table.

Other times, though, it’s more or less the total time. For example, consider a sort of 10,000 rows. In order to return the first row, you have to sort all 10,000 rows to work out which one comes first, so the startup time will be almost equal to the total time, and will vary dramatically based on how many rows you have.

Actual Rows

The number of rows returned by the operation per loop.

The important thing to notice about this is that it’s an average of all the loops executed, rounded to the nearest integer. This means that while “Actual Loops” × “Actual Rows” is a pretty good approximation for total number of rows returned most of the time, it can be off by as much as half of the number of loops. Usually it’s just slightly confusing as you see a row or two appear or disappear between operations, but on operations with a large number of loops there is potential for serious miscalculation.

Rows Removed by Filter

The rows removed by the “Filter”, as described above. Like most of the other “Actual” values, this is a per-loop average, with all the confusion and loss of accuracy that entails.

Buffers Fields

These values describe the memory/disk usage of the operation. They’ll only appear when you generate a plan with the BUFFERS flag set, which defaults to off - although it’s so useful that there is a movement afoot to turn it on by default when ANALYZE is enabled.

Each of the ten buffers keys consist of two parts, a prefix describing the type of information being accessed, and a suffix describing how it has been read/written. Unlike the other actual values, they are total values, ie not per-loop, although they still include values for their child operations.

There are three prefixes:

  • Shared blocks contain data from normal tables and indexes.
  • Local blocks contain data from temporary tables and indexes (yes, this is quite confusing given that there is also a prefix “Temp”).
  • Temp blocks contain short-term data used to calculate hashes, sorts, Materialize operations, and similar cases.

There are four suffixes:

  • Hit means that the block was found in the cache, so no full read was necessary.
  • Read blocks were missed in the cache and had to be read from the normal source of the data.
  • Dirtied blocks have been modified by the query.
  • Written blocks have been evicted from the cache.

So for example “Shared Hit Blocks” is the number of blocks read from cached indexes/tables.

I/O Write Time, I/O Read Time

These are also obtainable through buffers if you set a config option to enable them. If you are a superuser, you can collect these values during the current session by executing SET track_io_timing = on;

Verbose Fields

VERBOSE is a bit of a funny flag — it’s a grab-bag of different extra pieces of information that don’t necessarily have much in common.

Workers

A summary of the operation detail broken down by thread. Note that, like the “Workers Planned” and “Workers Launched” values for a parallelised operation, these entries don’t include the main thread (although you can work out the values for the main thread by subtracting the thread totals from the values of the field on the operation).

Output

The columns returned by the operation.

Schema

The schema of the “Relation Name” object.

Hopefully this has gone some way to explaining what those mysterious values on query plans really mean. There are plenty more operation-specific fields which you’ll see more or less often, but these are the core ones you’ll see every time you look at a plan.

Did I miss anything? Make any terrible mistakes? Anything you’d like to hear more about? Let me know, in the comments or at dave[at]pgmustard[dot]com!


Postgres Execution Plans — Field Glossary was originally published in pgMustard on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leigh Halliday: Efficient GraphQL queries in Ruby on Rails & Postgres

$
0
0
GraphQL puts the user in control of their own destiny. Yes, they are confined to your schema, but beyond that they can access the data in any which way. Will they ask only for the "events", or also for the "category" of each event? We don't really know! In REST based APIs we know ahead of time what will be rendered, and can plan ahead by generating the required data efficiently, often by eager-loading the data we know we'll need. In this article, we will discuss what N+1 queries are, how…

Hubert 'depesz' Lubaczewski: How to run short ALTER TABLE without long locking concurrent queries


Claire Giordano: What DjangoCon has to do with Postgres and Crocodiles. An interview with Louise Grandjonc from Microsoft

$
0
0

When Django developer and Azure Postgres* engineer Louise Grandjonc confirmed that she could sit down with me for an interview in the days leading up to DjangoCon 2019, I jumped at the chance. Those of you who were in the room for Louise’s talk this week probably understand why. Louise explains technical topics in a way that makes sense—and she often uses unusual (and fun) examples, from crocodiles to owls, from Harry Potter to Taylor Swift.

And since I experience a bit of FOMO whenever I miss a fun developer conference like DjangoCon, I especially wanted to learn more about Louise’s DjangoCon talk: Postgres Index Types and where to find them.

Here’s an edited transcript of my interview with Louise Grandjonc of Microsoft (@louisemeta on Twitter.)

* Editor’s Note: Azure Database for PostgreSQL is a fully-managed database as a service. Now including Hyperscale (Citus) as a built-in deployment option that enables you to scale out Postgres horizontally on Azure.

What are the key takeaways for your DjangoCon talk on “Postgres Index Types and where to find them”?

Louise:

  1. The 1st thing: I want you to understand that Postgres indexes are useful for 2 reasons. Performance, and Constraints.

  2. The 2nd thing: I hope people walk away with a clear overview of all the different options you get when you create an index—including partial indexes, unique indexes (that force a constraint), multi-column indexes, as well as standard (no condition) indexes. I want to help Django developers understand when to pick which one, and when the different index options can help you.

  3. Third, I want you to understand the different types of most popular Postgres indexes and what differentiates them, including balanced trees (BTrees), GIN, GIST, and BRIN indexes.

Oh, and one more thing about my talk…. I always try to make sure that the slides of my talks are useful by themselves, even if you missed the talk. Or in case you get distracted during my talk, so it’s easy to catch up when you start listening again! Let me know if I succeeded. (Link to Louise’s DjangoCon 2019 slides.)

And I add in the crocodile drawings because, well, they are adorable and they help to keep my audience engaged. When I was naming my first talk on Postgres indexes, I had just read this cute story about birds eating food out of crocodile’s teeth and it also helps the crocodile. Whenever I give a very technical talk, I like to have a story, an example that attendees can follow.

confused crocodile drawing
Crocodile drawings by https://www.instagram.com/zimmoriarty/

You’re a Django developer and also a Postgres developer. What advice do you give to Django developers who want to learn more about Postgres?

Louise:

Well, if you have never learned anything about databases, it’s true that getting started with a database can be complex… and of course people don’t want to break anything. It’s easy to be scared of breaking things.

So first I empathize! I tell people that it’s already so much work to learn Python, and Django—so it can be overwhelming to have to learn SQL and databases, too.

And yet—if you’re curious, and interested in understanding more about Postgres, then you’ll be able to learn how to leverage Postgres and your ORM… And there are some great resources to help you.

What “getting started” resources for PostgreSQL do you recommend most often to Django developers?

Louise:

  1. There is a useful website called pgexercises.com. It’s a good resource for learning SQL—in a PostgreSQL environment. While it’s true that your Django ORM covers 90% of what Postgres can do in terms of SQL, you might not know about the other 10%. So many people don’t know what things Postgres can do that their ORM can’t. It’s so hard to know what you don’t know.

    So the pgexercises website is a good way to learn more about what PostgreSQL can do.

  2. Markus Winand wrote a book, SQL Performance Explained. I recommend it.

    “SQL Performance Explained” was one of the first books I read about Postgres. It was the first year after I graduated from university, after I’d been working with Postgres for a while and once I became more curious. Markus’s book helped me understand so so much about the Postgres database and what I was doing wrong. As well as what I was doing right.

    It’s a good book for Django developers.

    Also, it’s a short book. That’s a very important thing. Some books are so huge, you just know you’re not going to read the entire thing. This is especially true with technology books.

  3. Conferences! I have learned so much from attending conferences and having people share their knowledge with me. I really love how in the open source community—like Python, Django or Postgres—everyone seems eager to share their learnings.

Are there any tools in Django that are useful for learning how to optimize app performance by optimizing Postgres?

Louise:

Yes. Django includes the Django debug toolbar django-debug-toolbar. It does so many things—how about I focus on what the debug toolbar does for SQL and for SQL queries?

When you’re creating a view in Django, the Python code will retrieve the data that will be displayed in your template. The view will send the data and the page template will display the data. When you open your new endpoint (something like users/louisegrandjonc), the endpoint will display whatever you want, and the django-debug-toolbar will ADD something to that page, which will contain all the queries that have been executed when loading the page (as well as the EXPLAIN plan of the queries ). It also gives you the code line that generated the query, which is amazing to debug a loop that would generate unfortunate queries for example.

You mentioned how important it is to be willing to try things, “to play with your database.” What did you mean?

Louise:

My advice: as long as you’re working on a local database, on your own laptop—not in production—I tell developers to play with the database. What I mean is: try to change the configuration of your database, for example to be able to have logs. (Logs are so useful. django-debug-toolbar is a good tool, but it doesn’t do everything, and logs definitely augment django-debug-toolbar.) And also try to run SQL queries, write them, test them. Learning on a dataset that you know, on a project that you love will make it so much easier to assimilate new things!

When you’re working with a local database, when you’re not scared of screwing up your data, you can more easily test things to validate your understanding and learn how things work, experientially.

For example, I recommend you learn to use pg_stat_statements in Postgres—both in production, and also in testing—to find very slow queries.

You told me Django is such an impressive Python development platform? Why?

Louise:

The first thing I have to say is that the Django community is really wonderful. Welcoming for newcomers. And diverse. The Django community has been working on diversity & inclusion for a long time. In fact, the first conference I ever presented at was DjangoCon Europe. I got the courage to submit because I knew it would be a good crowd to talk in front of, I knew the audience would be positive, supportive, open-minded, even though I was sure I would be a little bit stressed when the time came to give the talk.

Oh, and the Django documentation is really good: it’s easy to find the answers that you need. And the getting started documentation is also good for, well, for getting started. But I still use it when I forget things like, “what’s the command to create a new app again?”

The Django community is quite active online, for example on the mailing list, or on Stack Overflow where you’re likely to find answers to some of your Django questions, too.

And when it comes to Postgres, why is Django so impressive?

Louise:

Django supports a lot of data types that are specific to Postgres, such as JSONB, tsrange, arrays, hstore, and more. Plus Django supports Window functions, subqueries, aggregations… as well as the full range of indexes that you get out of the box with Postgres. And Django supports Postgres’s full text search feature—also PostGIS which is another popular Postgres extension used for geospatial.

And a lot of these features come from contributions from the community. Some started as applications that then were integrated in core. As part of the Postgres team at Microsoft, in addition to working on Azure Database for PostgreSQL, one of my Citus open source projects is a Django library called django-multitenant. Django-multitenant makes it easy for developers building multi-tenant apps to build their Django app on top of Citus, which is an extension to Postgres that scales out Postgres horizontally. That is just one example of a open source contribution from my own database team at Citus—there are so many valuable community contributions from others, too!

Someone recently told me that one of the reasons Django is so good with Postgres is because the Django creators were Postgres users. I didn’t know that. I’m told that they—the Django creators—built good compatibility with Postgres from scratch, from the beginning. And then of course the community contributions started.

Django is constantly evolving as developers contribute and add to it—so Django support for Postgres just keeps getting better.

So if I’m a developer and I’m using the Django ORM, why do I need to know SQL at all?

Louise:

One reason to know SQL and to learn a bit about Postgres is to optimize performance. So you can answer questions like: why is that query slow? If you don’t understand the query, you can’t really debug it, find the missing index, or a better way to write it.

The 2nd reason: even if Django has gone pretty far into what it supports for Postgres, there are still some features that are not supported by the Django ORM yet (such as GROUPING SETS, LATERAL JOIN, Common Table Expression—often called CTEs—and Grouping sets. Maybe you don’t need those features, or maybe you just don’t know about them. And from my experience, when I discover something new, I realize that I needed it all along and so I start using it!

The Django ORM is great and works really well with Postgres. And it’s also great to know that you can use extra features that will help you build a performant and beautiful app!

Thank you so much! I hope you have an awesome time at DjangoCon.

Louise:

Thank you. I’ve been working with Django the last 6 years and just love it. And I love the Postgres features that have rolled out in Django. It’s a very cool framework—and I’m so excited to be able to give a talk at DjangoCon on Postgres indexes this year.

Dimitri Fontaine: Why Postgres?

$
0
0

Photo by Emily Morter unsplash-logoEmily Morter

That’s a very popular question to ask these days, it seems. The quick answer is easy and is the slogan of PostgreSQL, as seen on the community website for it: “PostgreSQL: The World’s Most Advanced Open Source Relational Database”. What does that mean for you, the developer?

In my recent article The Art of PostgreSQL: The Transcript, part I you will read why I think it’s interesting to use Postgres in your application’s stack. My conference talk addresses the main area where I think many people get it wrong:

Postgres is a RDBMS

RDBMS are not a storage solution

Do not use Postgres to solve a storage problem!

Bruce Momjian: Implementing Transparent Data Encryption in Postgres

$
0
0

For the past 16 months, there has been discussion about whether and how to implement Transparent Data Encryption (TDE) in Postgres. Many other relational databases support TDE, and some security standards require it. However, it is also debatable how much security value TDE provides.

The TDE 400-email thread became difficult for people to follow, partly because full understanding required knowledge of Postgres internals and security details. A group of people who wanted to move forward began attending a Zoom call, hosted by Ahsan Hadi. The voice format allowed for more rapid exchange of ideas, and the ability to quickly fill knowledge gaps. It was eventually decided that all-cluster encryption was the easiest to implement in the first version. Later releases will build on this.

Fundamentally, TDE must meet three criteria — it must be secure, obviously, but it also must be done in a way that has minimal impact on the rest of the Postgres code. This has value for two reasons — first, only a small number of users will use TDE, so the less code that is added, the less testing is required. Second, the less code that is added, the less likely TDE will break because of future Postgres changes. Finally, TDE should meet regulatory requirements. This diagram by Peter Smith illustrates the constraints.

Continue Reading »

Robert Treat: Introducing phpPgAdmin 7.12.0

$
0
0

After an overly long development cycle, I'm pleased to introduce the latest release of phpPgAdmin, version 7.12.0.

As with many software releases, the code changes are plenty, and the release bullets are few, but they are quite important. In this release we have:

  • PHP 7 is now the default version for development, and the minimum version required for phpPgAdmin going forward. Most users are currently running PHP 7, so we're happy to support this going forward, and encourage users of PHP 5.x to upgrade for continued support.

  • We've added support for all current versions of PostgreSQL, including the pending PostgreSQL 12 release. Our aim going forward will be to ensure that we are properly supporting all current release of Postgres, with degraded support for EOL versions.

  • We've updated some internal libraries, fixed additional bugs, and merged many patches that had accumulated over the years. We want to thank everyone who provided a patch, whether merged or not, and hope you will consider contributing to phpPgAdmin in the future.

This version also comes with a change to our development and release cycle process. When the project originally started, we developed and released new versions like traditional desktop software; annual-ish releases for new versions with all the new features, while providing a few periodic bugfix releases in between. While this was ok from a developers point of view, that meant users had to wait for months (and in unfortunate cases, years) between releases to get new code. As developers, we never felt that pain, because developers would just run code directly from git master. As it turns out, that is a much better experience, and as much of the software world has changed to embrace that idea, our process is going to change as well.

The first part of this is changing how we number our releases. Going forward, our versions numbers will represent:

- the primary PHP version supported (7), 
- the most recent version of PostgreSQL supported (12), 
- and the particular release number in that series (0).   

Our plan is to continue developing on this branch (7_12) and releasing new features and bug fixes as often as needed. At some point about a year from now, after PostgreSQL has branched for Postgres 13/14, we'll incorporate that into an official release, and bump our release number to 7.13.0. Presumably, in a few years, there will eventually be a release of PHP 8, and we'll start planning that change at that time. We hope this will make it easier for both users and contributors going forward.

For more information on phpPgAdmin, check out our project page at https://github.com/phppgadmin/phppgadmin/ You can download the release at: https://github.com/phppgadmin/phppgadmin/releases/tag/REL_7-12-0

Once again, I want to thank everyone who has helped contribute to phpPgAdmin over the years. The project has gone through some ups and downs, but despite that is still used by a very large number of users and it enjoys a healthy developer ecosystem. We hope you find this new release helpful!

Regina Obe: PostGIS 3.0.0beta1

$
0
0

The PostGIS development team is pleased to release PostGIS 3.0.0beta1.

This release works with PostgreSQL 9.5-12RC1 and GEOS >= 3.6

Best served with PostgreSQL 12RC1 and GEOS 3.8.0beta1 both of which came out in the past couple of days.

Continue Reading by clicking title hyperlink ..
Viewing all 9730 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>