Alexey Lesovsky: Let’s speed things up.

April 23, 2018, 5:03 am

≫ Next: Sehrope Sarkuni: "ANSI? Schmansi!" at PostgresConf 2018

≪ Previous: Laurenz Albe: What’s in an xmax?

Parallelism settings and how to accelerate your database performance.

Few days ago, when performing a routine audit for our client, we noticed that parallel queries in the database were disabled. It wouldn’t be surprising but it’s important to note that our client has powerful servers that run not only lightweight OLTP queries. When we brought it to our client’s attention his first question was "How can we enable it?" and after that he reasonably added "Will there be any negative consequences if we do?"

This kind of question pops up more often than one thinks, hence this post where we look into parallel queries in detail.

Parallel queries were introduced 2 releases ago, and some significant changes were made in Postgres 10, thus a configure procedure in 9.6 and 10 is slightly different. Also, in the upcoming Postgres 11, there are several new features added related to parallelism.

Parallel queries feature has been introduced in Postgres 9.6 and is available for a limited set of operations such as SeqScan, Aggregate and Join. In Postgres 10 the list of supported operations has been broadened to: Bitmap Heap Scan, Index Scan, Index-Only Scan. Postgres 11 (hopefully!) will be released with the support of parallel index creation and support of parallel DML commands that create tables, such as CREATE TABLE AS, SELECT INTO and CREATE MATERIALIZED VIEW.

When parallelism has been introduced in 9.6 it has been disabled by default, but in Postgres 10 it has been enabled - our client’s database runs on 9.6, so this explains why it was disabled. Depending on the Postgres version there are 2 or 3 parameters which enable parallel queries and they are located in postgresql.conf in the "Asynchronous Behavior" section.

The first one is max_worker_processes which has been added in Postgres 9.4 and sets the limit of background processes that Postgres can run (the default is 8). It includes background workers and doesn't include system background processes such as checkpointer, bgwriter, wal senders/receivers, etc. Note, that to change this parameter the Postgres needs to be restarted.

The second parameter is max_parallel_workers_per_gather, which is 0 by default in Postgres 9.6 and means that parallel queries feature is disabled. To enable it, it must be greater than 0. In Postgres 10, the default value has been changed to 2. This parameter defines maximum number of allowed workers per single parallel query.

The third parameter is max_parallel_workers, added in Postgres 10 and defines maximum number of workers only for parallel queries, because of max_worker_processes also relates to the background workers.

Overall, these three parameters define a general limit of workers and limit of workers used for parallel queries. What values should be used? This will depend on the number of CPU cores and resources capacity of any specific storage system. It's obvious that parallel queries may consume more resources than non-parallel queries, it is related to CPU time, memory, IO and so on. Imagine that Postgres launches high number of parallel queries and parallel queries pool is totally exhausted, in this case, system must have free cores and storage throughput to still be able to run non-parallel queries without performance slowdown. For example, for system with 32 CPU cores and SSDs, a good starting point is:

max_worker_processes = 12

max_parallel_workers_per_gather = 4

max_parallel_workers = 12

These settings allow to run at least 3 parallel queries concurrently with maximum of 4 workers per query, and to have 20 cores for other, non-parallel queries. If using background workers max_worker_processes should be increased accordingly.

That, of course, is not all the parameters that need to be taken into consideration when deciding between parallel and non-parallel queries. In "Planner Cost Constants" section of postgresql.conf additional advanced settings can be found.

parallel_tuple_cost and parallel_setup_cost define the extra costs that being added to the total query's cost in case of using parallelism. Another one is min_parallel_relation_size which is only available in Postgres 9.6, it defines a minimal size of relations that can be scanned in parallel. In Postgres 10, indexes also can be scanned in parallel, so this parameter has been split to min_parallel_table_scan_size and min_parallel_index_scan_size - that allow to perform same action for tables and indexes respectively.

In general, cost parameters can be leaved as is. Their configuration might need a little bit of adjustment in rare cases for tuning of particular queries.I'd also like to mention that in Postgres 11 parallel index creation will be introduced and additional parameter will be added - max_parallel_maintenance_workers that defines the maximum number of workers used in CREATE INDEX command.

Do you use parallelism? Did you have any issues using it? Would be great to hear your thoughts!

↧

Sehrope Sarkuni: "ANSI? Schmansi!" at PostgresConf 2018

April 22, 2018, 9:00 pm

≫ Next: Dimitri Fontaine: PostgreSQL Data Types: XML

≪ Previous: Alexey Lesovsky: Let’s speed things up.

I gave a talk last week at PostgresConf US 2018 titled "ANSI, Schmansi! How I learned to stop worrying and love Postgres-isms". The slides for the talk can be viewed online here:

https://sehrope.github.io/postgres-conf-2018-ansi-schmansi/#/

This talk is about using the full potential of your choice of database with concrete examples of Postgres-specific features. The audience was wonderful and I look forward to more feedback on the topic.

The slides were built with a customized reveal.js template for writing in pug. This allows for most of the example SQL to be housed in separately files (in the sql/ directory) and referenced from the templates. The source for the slides including the example SQL is available here.

Do you think embracing database specific behavior is a great idea? Let me know!

↧

Dimitri Fontaine: PostgreSQL Data Types: XML

April 23, 2018, 9:18 am

≫ Next: Hubert 'depesz' Lubaczewski: Waiting for PostgreSQL 11 – Add json(b)_to_tsvector function

≪ Previous: Sehrope Sarkuni: "ANSI? Schmansi!" at PostgresConf 2018

Continuing our series of PostgreSQL Data Types today we’re going to introduce the PostgreSQL XML type.

The SQL standard includes a SQL/XML which introduces the predefined data type XML together with constructors, several routines, functions, and XML-to-SQL data type mappings to support manipulation and storage of XML in a SQL database, as per the Wikipedia page.

↧

Hubert 'depesz' Lubaczewski: Waiting for PostgreSQL 11 – Add json(b)_to_tsvector function

April 23, 2018, 11:18 am

≫ Next: Tatsuo Ishii: More load balancing fine control

≪ Previous: Dimitri Fontaine: PostgreSQL Data Types: XML

On 7th of April 2018, Teodor Sigaev committed patch: Add json(b)_to_tsvector function Jsonb has a complex nature so there isn't best-for-everything way to convert it to tsvector for full text search. Current to_tsvector(json(b)) suggests to convert only string values, but it's possible to index keys, numerics and even booleans value. To solve that json(b)_to_tsvector has […]

↧

Tatsuo Ishii: More load balancing fine control

April 24, 2018, 1:28 am

≫ Next: damien clochard: The Open Decision Framework

≪ Previous: Hubert 'depesz' Lubaczewski: Waiting for PostgreSQL 11 – Add json(b)_to_tsvector function

Pgpool-II is known as it provides load balancing across multiple PostgreSQL servers for read queries.

Consider a cluster consisting of Pgpool-II and two PostgreSQL servers configured with streaming replication. One is a primary server, the other is a standby server. Also the load balancing weight is 0:1, which means read queries will be basically sent to the standby server.

In an explicit transaction if a read query is sent from a client, Pgpool-II sends it to the standby server (as long as replication delay is not too much). Next a write query is coming. Of course it is sent to the primary server. Interesting thing happens if a read query is sent. Unlike previous read query, it is sent to the primary to avoid the replication delay.

Pgpool-II 4.0 will have a new configuration parameter called "disable_load_balance_on_write" to give more fine control of the behavior above.

If disable_load_balance_on_write = 'transaction', the behavior is exactly same as above. This is the default.

If disable_load_balance_on_write = 'off', Pgpool-II no longer takes account of the write query and feels free to load balance read queries as much as possible.

This choice is good if you want to take maximum benefit of load balancing.

The variant of 'tranacton' is, 'trans_transaction'. Unlike 'transaction', the effect of disable load balancing after a write query continues even if the transaction closes and a next transaction begins.

So 'trans_transaction' is good for those who want data consistency than load balancing. Please note that a ready query issued between explicit transactions are not affected by the parameter.

Finally there's 'always'. If the mode is specified and once write query is sent, any read query regardless it's in an explicit transaction or not, will not be load balanced and sent to the primary server until the session ends.

In summary, the degree of protecting data consistency is in the order of always, trans_transaction, transaction and off. In contrast, the degree of performance is in the order of off, transaction, trans_transaction and always. If you are not sure which one is the best choice for you, I recommend you to start with 'transaction' because this is on a good balance between data consistency and performance.

↧

damien clochard: The Open Decision Framework

April 24, 2018, 3:17 am

≫ Next: Dimitri Fontaine: PostgreSQL Data Types: Date and Time Processing

≪ Previous: Tatsuo Ishii: More load balancing fine control

Tomorrow I’ll talk about the Open Decision Framework at the TEQNation conference in The Netherlands and I’ll try to explain how we can use open source principles to take better decisions.

Free Software is a powerful movement with many different types of governance models and a lot of diversity. But all the successful open source projects have one thing in common : over the years they have developped efficient decision worklfows to solve their problems.

Making decisions using open source principles

Dalibo is a small Postgres company but as we’re growing, we face more complex issues and we want to maintain a governance model based on transparency and open thinking.

Last year, we learned about a method call «Open Decision Framework» developped by Red Hat. We translated it in French and then we “forked” it to fit our organisation.

So far this method has worked really well for us and it produced solutions to some big questions that were left unanswered for years. I’ve been so impressed by these results that I wanted to share our experience and show that open source principles can apply beyond writing code.

If you want to learn more about it here’s a shorter version of my talk recorded during FOSDEM in February :

PS: The video is a bit shaking and my english is rusty, but I hope the talk will make you want to learn more about the Open Decision Framework ! :)

↧

Dimitri Fontaine: PostgreSQL Data Types: Date and Time Processing

April 13, 2018, 4:35 am

≫ Next: Markus Winand: One Giant Leap For SQL: MySQL 8.0 Released

≪ Previous: damien clochard: The Open Decision Framework

Continuing our series of PostgreSQL Data Types today we’re going to introduce date and time based processing functions.

Once the application’s data, or rather the user data is properly stored as timestamp with time zone, PostgreSQL allows implementing all the processing you need to. In this article we dive into a set of examples to help you get started with time based processing in your database. Can we boost your reporting skills?

↧

Markus Winand: One Giant Leap For SQL: MySQL 8.0 Released

April 24, 2018, 5:00 pm

≫ Next: Pavel Stehule: pspg 1.1 with readline support

≪ Previous: Dimitri Fontaine: PostgreSQL Data Types: Date and Time Processing

One Giant Leap For SQL: MySQL 8.0 Released

“Still using SQL-92?” is the opening question of my “Modern SQL” presentation. When I ask this question, an astonishingly large portion of the audience openly admits to using 25 years old technology. If I ask who is still using Windows 3.1, which was also released in 1992, only a few raise their hand…but they’re joking, of course.

Clearly this comparison is not entirely fair. It nevertheless demonstrates that the know-how surrounding newer SQL standards is pretty lacking. There were actually five updates since SQL-92—many developers have never heard of them. The latest version is SQL:2016.

As a consequence, many developers don’t know that SQL hasn’t been limited to the relational algebra or the relational model since 1999. SQL:1999 introduced operations that don't exist in relational algebra (with recursive, lateral) and types (arrays!) that break the traditional interpretation of the first normal form.⁰

Since then, so for 19 years, whether or not a SQL feature fits the relational idea isn’t important anymore. What is important is that a feature has well-defined semantics and solves a real problem. The academic approach has given way to a pragmatic one. Today, the SQL standard has a practical solution for almost every data processing problem. Some of them stay within the relational domain, while others do not.

Resolution

Don’t say relational database when referring to SQL databases. SQL is really more than just relational.

It’s really too bad that many developers still use SQL in the same way it was being used 25 years ago. I believe the main reasons are a lack of knowledge and interest¹ among developers along with poor support for modern SQL in database products.

Let’s have a look at this argument in the context of MySQL. Considering its market share, I think that MySQL’s lack of modern SQL has contributed more than its fair share to this unfortunate situation. I once touched on that argument in my 2013 blog post “MySQL is as Bad for SQL as MongoDB is to NoSQL”. The key message was that “MongoDB is a popular, yet poor representative of its species—just like MySQL is”. Joe Celko has expressed his opinion about MySQL differently: “MySQL is not SQL, it merely borrows the keywords from SQL”.

You can see some examples of the questionable interpretation of SQL in the MySQL WAT talk on YouTube.² Note that this video is from 2012 and uses MySQL 5.5 (the current GA version at that time). Since then, MySQL 5.6 and 5.7 came out, which improved the situation substantially. The default settings on a fresh installation are much better now.³

It is particularly nice that they were really thinking about how to mitigate the effects of changing defaults. When they enabled ONLY_FULL_GROUP_BY by default, for example, they went the extra mile to implement the most complete functional dependencies checking among the major SQL databases:

Availability of Functional Dependencies

About the same time MySQL 5.7 was released, I stopped bashing MySQL. Of course I'm kidding. I'm still bashing MySQL occasionally…but it has become a bit harder since then.

By the way, did you know MySQL still doesn’t support check constraints? Just as in previous versions, you can use check constraints in the create table statement but they are silently ignored. Yes—ignored without warning. Even MariaDB fixed that a year ago.

Availability of CHECK constraints

Uhm, I’m bashing again! Sorry—old habits die hard.

Nevertheless, the development philosophy of MySQL has visibly changed over the last few releases. What happened? You know the answer already: MySQL is under new management since Oracle bought it through Sun. I must admit: it might have been the best thing that happened to SQL in the past 10 years, and I really mean SQL—not MySQL.

The reason I think a single database release has a dramatic effect on the entire SQL ecosystem is simple: MySQL is the weakest link in the chain. If you strengthen that link, the entire chain becomes stronger. Let me elaborate.

MySQL is very popular. According to db-engines.com, it’s the second most popular SQL database overall. More importantly: it is, by a huge margin, the most popular free SQL database. This has a big effect on anyone who has to cope with more than one specific SQL database. These are often software vendors that make products like content management systems (CRMs), e-commerce software, or object-relational mappers (ORMs). Due to its immense popularity, such vendors often need to support MySQL. Only a few of them bite the bullet and truly support multiple database—Java Object Oriented Querying (jOOQ) really stands out in this regard. Many vendors just limit themselves to the commonly supported SQL dialect, i.e. MySQL.

Another important group affected by MySQL’s omnipresence are people learning SQL. They can reasonably assume that the most popular free SQL database is a good foundation for learning. What they don't know is that MySQL limits their SQL-foo to the weakest SQL dialect among those being widely used. Based loosely on Joe Celko’s statement: these people know the keywords, but don’t understand their real meaning. Worse still, they have not heard anything about modern SQL features.

Last week, that all changed when Oracle finally published a generally available (GA) release of MySQL 8.0. This is a landmark release as MySQL eventually evolved beyond SQL-92 and the purely relational dogma. Among a few other standard SQL features, MySQL now supports window functions (over) and common table expressions (with). Without a doubt, these are the two most important post-SQL-92 features.

The days are numbered in which software vendors claim they cannot use these features because MySQL doesn't support them. Window functions and CTEs are now in the documentation of the most popular free SQL database. Let me therefore boldly claim: MySQL 8.0 is one small step for a database, one giant leap for SQL.⁴

It gets even better and the future is bright! As a consequence of Oracle getting its hands on MySQL, some of the original MySQL team (among them the original creator) created the MySQL fork MariaDB. Apparently, their strategy is to add many new features to convince MySQL users to consider their competing product. Personally I think they sacrifice quality—very much like they did before with MySQL—but that’s another story. Here it is more relevant that MariaDB has been validating check constraints for a year now. That raises a question: how much longer can MySQL afford to ignore check constraints? Or to put it another way, how much longer can they endure my bashing ;)

Besides check constraints, MariaDB 10.2 also introduced window functions and common table expressions (CTEs). At that time, MySQL had a beta with CTEs but no window functions. MariaDB is moving faster.⁵

In 10.3, MariaDB is set to release “system versioned tables”. In a nutshell: once activated for a table, system versioning keeps old versions for updated and deleted rows. By default, queries return the current version as usual, but you can use a special syntax (as of) to get older versions. Your can read more about this in MariaDBs announcement.

System versioning was introduced into the SQL standard in 2011. As it looks now, MariaDB will be the first free SQL database supporting it. I hope this an incentive for other vendors—and also for users asking their vendors to support more modern SQL features!

Now that the adoption of modern SQL has finally gained some traction, there is only one problem left: the gory details. The features defined by the standard have many subfeatures, and due to their sheer number, it is common practice to support only some of them. That means it is not enough to say that a database supports window functions. Which window functions does it actually support? Which frame units (rows, range, groups)? The answers to these questions make all the difference between a marketing gag and a powerful feature.

In my mission to make modern SQL more accessible to developers, I’m testing these details so I can highlight the differences between products. The results of these tests are shown in matrices like the ones above. The rest of this article will thus briefly go through the new standard SQL features introduced with MySQL 8.0 and discuss some implementation differences. As you will see, MySQL 8.0 is pretty good in this regard. The notable exception is its JSON functionality.

Window Functions

There is SQL before window functions and SQL after window functions. Without exaggeration, window functions are a game changer. Once you understood window functions, you cannot imagine how you could ever have lived without them. The most common use cases, for example finding the best N rows per group, building running totals or moving averages, and grouping consecutive events, are just the tip of the iceberg. Window functions are one of the most important tools to avoid self-joins. That alone makes many queries less redundant and much faster. Window functions are so powerful that even newcomers like several Apache SQL implementations (Hive, Impala, Spark), NuoDB and Google BigQuery introduced them years ago. It’s really fair to say that MySQL is pretty late to this party.

The following matrix shows the support of the over clause for some major SQL databases. As you can see, MySQL’s implementation actually exceeds the capabilities of “the world’s most advanced open source relational database”, as PostgreSQL claims on its new homepage. However, PostgreSQL 11 is set to recapture the leader position in this area.

Availability of OVER

The actual set of window functions offered by MySQL 8.0 is also pretty close to the state of the art:

Availability of Window-Functions

Common Table Expressions (`with [recursive]`)

The next major enhancement for MySQL 8.0 are common table expressions or the with [recursive] clause. Important use cases are traversing graphs with a single query, generating an arbitrary number of rows, converting CSV strings to rows (reversed listagg / group_concat) or just literate SQL.

Again, MySQL’s first implementation closes the gap.

Availability of WITH

Other Standard SQL Features

Besides window functions and the with clause, MySQL 8.0 also introduces some other standard SQL features. However compared to the previous two, these are by no means killer features.

Other new standard SQL features in MySQL 8.0

As you can see, Oracle pushes standard SQL JSON support. The Oracle database and MySQL are currently the leaders in this area (and both are from the same vendor!). The json_objectagg and json_arrayagg functions were even backported to MySQL 5.7.22. However, it’s also notable that MySQL doesn’t follow the standard syntax for these two functions. Modifiers defined in the standard (e.g. an order by clause) are generally not supported. Json_objectagg neither recognizes the keywords key and value nor accepts the colon (:) to separate attribute names and values. It looks like MySQL parses these as regular functions calls—as opposed to syntax described by the standard.

It’s also interesting to see that json_arrayagg handles null values incorrectly, very much like the Oracle database (they don’t default to absent on null⁶). Seeing the same issue in two supposedly unrelated products is always interesting. Adding the fact that both products come from the same vendor adds another twist.

The two last features in the list, grouping function (related to rollup) and column names in the from clause are solutions to pretty specific problems. Their MySQL 8.0 implementation is basically on par with that of other databases.

Furthermore, MySQL 8.0 also introduced standard SQL roles. The reason this is not listed in the matrix above is simple: the matrices are based on actual tests I run against all these databases. My homegrown testing framework does not yet support test cases that require multiple users—currently all test are run with a default user, so I cannot test access rights yet. However, the time for that will come—stay tuned.

Other Notable Enhancements

I'd like to close this article with MySQL 8.0 fixes and improvements that are not related to the SQL standard.

One of them is about using the desc modifier in index declarations:

CREATE INDEX … ON … (<column> [ASC|DESC], …)

Most—if not all—databases use the same logic in the index creation as for the order by clause, i.e. by default, the order of column values is ascending. Sometimes it is needed to sort some index columns in the opposite direction. That’s when you specify desc in an index. Here’s what the MySQL 5.7 documentation said about this:

An index_col_name specification can end with ASC or DESC. These keywords are permitted for future extensions for specifying ascending or descending index value storage. Currently, they are parsed but ignored; index values are always stored in ascending order.

“They are parsed but ignored”? To be more specific: they are parsed but ignored without warning very much like check constraints mentioned above.

However, this has been fixed with MySQL 8.0. Now there is a warning. Just kidding! Desc is honored now.

There are many other improvements in MySQL 8.0. Please refer to “What’s New in MySQL 8.0?” for a great overview. How about a small appetizer:

Histograms (optimizer statistics)
Data dictionary in InnoDB (see also The end of MyISAM)
Invisible indexes
Better defaults (no more latin1_swedish_ci anymore!)

“One Giant Leap For SQL: MySQL 8.0 Released” by Markus Winand was originally published at modern SQL.

↧

Pavel Stehule: pspg 1.1 with readline support

April 25, 2018, 10:55 pm

≫ Next: Hans-Juergen Schoenig: ora_migrator: Moving from Oracle to PostgreSQL even faster

≪ Previous: Markus Winand: One Giant Leap For SQL: MySQL 8.0 Released

I released new version of pspg. Last version has integrated readline library - now, the editing of searching stings is little bit more comfort and mainly, it is stored to persistent history file controlled by readline library.

↧

Hans-Juergen Schoenig: ora_migrator: Moving from Oracle to PostgreSQL even faster

April 26, 2018, 1:05 am

≫ Next: Hubert 'depesz' Lubaczewski: Waiting for PostgreSQL 11 – Indexes with INCLUDE columns and their support in B-tree

≪ Previous: Pavel Stehule: pspg 1.1 with readline support

As some of you might know Cybertec has been doing PostgreSQL consulting, tuning and 24×7 support for many years now. However, one should not only see what is going on in the PostgreSQL market. It also makes sense to look left and right to figure out what the rest of the world is up to these day. I tend to read read a lot about Oracle, the cloud, their new strategy and all that on the Internet these days.

Oracle, PostgreSQL and the cloud

What I found in an Oracle blog post seems to sum up what is going on: “This is a huge change in Oracle’s business model, and the company is taking the transformation very seriously”. The trouble is: Oracle’s business model might change a lot more than they think because these days more and more and more people are actually moving to PostgreSQL. So yes, the quote is correct – but this is a “huge change” in their business model but maybe not what they way they intended it to be.

As license fees and support costs seems to be ever increasing for Oracle customers, more and more people step back and reflect. The logical consequence is: People are moving to PostgreSQL in ever greater numbers. So why not give people a tool to move to PostgreSQL as fast as possible? In many cases an Oracle database is only used as a simple data store. The majority of systems only contains tables, constraints, indexes, foreign keys and so on. Sure, many databases will also contain procedures and more sophisticated stuff. However, I have seen countless systems, which are simply trivial.

Migrating to PostgreSQL

While there are tool such as ora2pg out there, which help people to move from Oracle to PostgreSQL, I found them in general a bit cumbersome and not as efficient as they could be.
So why not build a migration tool, which makes migration as simple as possible? Why not migrate simple database with a single SQL statement? We want to HURT Oracle sales people after all ;).

ora_migrator is doing exactly that:

CREATE EXTENSION ora_migrator;
SELECT oracle_migrate(server => 'oracle', only_schemas => '{HANS,PAUL}');

Unless you have stored procedures in PL/pgSQL or some other hyper fancy stuff, this is already it. The entire migration will be done in a SINGLE transaction.

How does it work?

Oracle to PostgreSQL migration — Oracle to PostgreSQL

First of all the ora_migrator will connect to Oracle using the oracle_fdw, which is the real foundation of the software. Then we will read the Oracle system catalog and store of a copy of the table definitions, index definitions and all that in PostgreSQL. oracle_fdw will do all the data type mapping and so on for us. Why do we copy the Oracle system catalog to a local table and not just use it directly? During the migration process you might want to make changes to the underlying data structure and so on (a commercial GUI to do that is available). You might not want to copy table definitions and so on blindly.

Once the definitions are duplicated and once you have made your modifications (which might not be the case too often – most people prefer a 1:1 copy), the ora_migrator will actually create the desired tables in PostgreSQL, load the data from Oracle, create indexes and add constraints. Your transaction will commit and, voila, your migration is done.

One word about Oracle stored procedures

Experience has shown that a fully automated migrated process for stored procedures usually fails. We have used ora2pg for a long time and procedure conversion has always been a pain when attempted automatically. Therefore we decided to skip this point entirely. To avoid nasty bugs, bad performance or simply mistakes we think that it makes more sense to port stored procedures manually. In the past automatic conversion has led to a couple of subtle bugs (mostly NULL handling issues) and therefore we do not attempt things in the first place. In real life this is not an issue and does not add too much additional work to the migration process – the minimum amount of additional time is worth spending on quality in my judgement).

ora_migrator is free

If you want to move from Oracle to PostgreSQL, you can download ora_migrator for free from our github page. We also have created a GUI for the ora_migrator, which will be available along with our support and consulting services.

More infos on ora_migrator can be found here: https://www.cybertec-postgresql.com/en/products/cybertec-enterprise-migrator-modern-oracle-postgresql-migration/

The post ora_migrator: Moving from Oracle to PostgreSQL even faster appeared first on Cybertec.

↧

Hubert 'depesz' Lubaczewski: Waiting for PostgreSQL 11 – Indexes with INCLUDE columns and their support in B-tree

April 26, 2018, 6:45 am

≫ Next: Jan Karremans: A week of PostgreSQL

≪ Previous: Hans-Juergen Schoenig: ora_migrator: Moving from Oracle to PostgreSQL even faster

On 7th of April 2018, Teodor Sigaev committed patch: Indexes with INCLUDE columns and their support in B-tree This patch introduces INCLUDE clause to index definition. This clause specifies a list of columns which will be included as a non-key part in the index. The INCLUDE columns exist solely to allow more queries to benefit […]

↧

Jan Karremans: A week of PostgreSQL

April 15, 2018, 11:12 am

≫ Next: Michael Paquier: Postgres 11 highlight - Group access on data folder

≪ Previous: Hubert 'depesz' Lubaczewski: Waiting for PostgreSQL 11 – Indexes with INCLUDE columns and their support in B-tree

One of the attractive things of my job is this… Just a bit more often than every now and then, you get the opportunity to get out and meet people to talk about Postgres. I don’t mean the kind of talk I do every day, which has more of a commercial touch to it. – Don’t get me wrong, that is very important too! – But I mean, really talk about PostgreSQL, be part of the community and help spread the understanding of what open source database technology can do for companies. Running implementations, either small or large, trivial or mission critical…

This past week was one of those weeks.

I got to travel through Germany together with Mr. Bruce Momjian himself. Bruce is the one of the most established and senior community leaders for Postgres. Bruce is also my colleague and I would like to think I may consider him my friend. My employer, EnterpriseDB, gives us the opportunity to do this. To be an integral part of the PostgreSQL community, contribute, help expand the fame of Postgres, no strings attached. Support the success of the 30 to 40,000 engineers creating this most advanced open source RDBMS.

The week started with travel, and I got to Frankfurt. Frankfurt will be the proving ground for the idea of a pop-up meet-up. Not an EDB-marketing event or somewhere where we sell EnterpriseDB services, but allow anyone just to discuss PostgreSQL.
We will be in a city, in a public place, answering questions, discussion things or just relax with some coffee. Purpose is to show what the PostgreSQL community is all about, to anyone interested!

The first day in Frankfurt, we spent at the 25hrs hotel. We had some very interesting discussions on:

Postgres vs. Oracle community
Changing role of DBA:
- The demise of the Oracle DBA
- RDBMS DBA not so much
Risk management
“Data scientist”
Significance of relational growing again

In the afternoon we took the Train to Munich, which was a quick and smooth experience. Munich would be the staging ground for a breakfast meeting, or a lunch… or just say hi.

Bruce and I spend the day discussing:

How to go from using Postgres as replacement of peripheral Oracle to Postgres as replacement for all Oracle
Using Postgres as polyglot data platform bringing new opportunities

After the meet-up we headed to Berlin training towards the final two events of this week.We spent Thursday teaching the EDB Postgres Bootcamp, having a lot of fun and absolutely not sticking to the program. With Bruce here, and very interesting questions from the participants, we were able to talk about the past and the future of Postgres and all the awesome stuff that is just around the corner.
Friday morning started with a brisk taxi-drive from Berlin to the Müggelsee Hotel. And, if you happen to talk to Bruce, you simply must ask him about this taxi-trip

pgconf.de ended up being a superb event with a record breaking number of visitors and lost of interesting conversations. You will find loads of impressions here!

I got to meet a great number of the specialists that make up the Postgres community:
Andreas ‘Ads’ Scherbaum
Devrim Gündüz
Magnus Hagander
Emre Hasegeli
Oleksii Kliukin
Stefanie Stölting
Ilya Kosmodemiansky
Valentine Gogichashvili

I am already looking forward to the next Postgres events I get to attend… pgconf.de 2019 will in any case happen on the 10th of May in Leipzig.
It would be super cool to see you there, please submit your abstracts using the information from this page!

The post A week of PostgreSQL appeared first on Johnnyq72 and was originally written by Johnnyq72.

↧

Michael Paquier: Postgres 11 highlight - Group access on data folder

April 28, 2018, 1:50 am

≫ Next: Dimitri Fontaine: PostgreSQL Data Types: JSON

≪ Previous: Jan Karremans: A week of PostgreSQL

The following commit, which has introduced a new feature for PostgreSQL 11, introduces the possibility to lower a bit the set of permissions around data folders:

commit: c37b3d08ca6873f9d4eaf24c72a90a550970cbb8
author: Stephen Frost <sfrost@snowman.net>
date: Sat, 7 Apr 2018 17:45:39 -0400
Allow group access on PGDATA

Allow the cluster to be optionally init'd with read access for the group.

This means a relatively non-privileged user can perform a backup of the
cluster without requiring write privileges, which enhances security.

The mode of PGDATA is used to determine whether group permissions are
enabled for directory and file creates.  This method was chosen as it's
simple and works well for the various utilities that write into PGDATA.

Changing the mode of PGDATA manually will not automatically change the
mode of all the files contained therein.  If the user would like to
enable group access on an existing cluster then changing the mode of all
the existing files will be required.  Note that pg_upgrade will
automatically change the mode of all migrated files if the new cluster
is init'd with the -g option.

Tests are included for the backend and all the utilities which operate
on the PG data directory to ensure that the correct mode is set based on
the data directory permissions.

Author: David Steele <david@pgmasters.net>
Reviewed-By: Michael Paquier, with discussion amongst many others.
Discussion: https://postgr.es/m/ad346fe6-b23e-59f1-ecb7-0e08390ad629%40pgmasters.net

Group access on the data folder means that files can optionally use 0640 as mask and folders can use 0750, which becomes handy for particularly backup scenarios where a different user than the one running PostgreSQL would be sufficient to take a backup of the instance. For some security policies, it is important to do an operation with a user which has the minimum set of permissions allowing to perform the task, so in this case a user which is member of the same group as the one running the PostgreSQL instance would be able to read all files in a data folder and take a backup from it. So not only this is useful for people implementing their own backup tool, but also for administrators looking at users able to do the backup task with only a minimal set of access permissions.

The feature can be enabled using initdb -g/–allow-group-access, which will create files using 0640 as mask and folder using 0750. Note that in v10 and older versions, trying to start a server with the base data folder having a permission different than 0700 results in a failure of the postmaster process, so with v11 and above the postmaster is able to start if the data folder is found as using either 0700 or 0750. Note that an administrator can also perfectly initialize a data folder without the option –allow-group-access first, and change it to use group permissions after with chmod -R or such, and the cluster will adapt automatically. In order to know if a data folder uses group access, a new GUC parameter called data_directory_mode is available, which returns the mask used, so for a data folder allowing group access you would see that:

=# SHOW data_directory_mode;
 data_directory_mode
---------------------
 0750
(1 row)

Sometimes deployments of PostgreSQL use advanced backup strategies mixing multiple solutions, which is why the following in-core tools also respect if group access is allowed in a cluster when fetching and writing files related to the cluster:

pg_basebackup, which respects permissions for both the tar and plain formats.
pg_receivewal will create new WAL segments using group permissions.
pg_recvlogical does the same for logical changes received.
pg_rewind.
pg_resetwal.

Note that it is not possible to enforce the mask received, so if a cluster has group access enabled, then all the tools mentioned above will automatically switch to it. It is not possible to write data with group access when the data folder does not use it, as well as to write data without group access when the data folder uses group access. So all the behaviors are kept consistent for simplicity.

For developers of tools and plugins in charge of writing data for a data folder or anything related to PostgreSQL, there is a simple way to track if group access is enabled on an instance. First, if you use a normal libpq connection, it is possible to check after data_directory_mode using the SHOW command (works as well with the replication protocol!). For tools working directly on a data folder, like pg_rewind or pg_resetwal, there is a new API available called GetDataDirectoryCreatePerm() which can be used to set a couple of low-level variables which would set the mask needed for files and folders automatically:

pg_mode_mask for the mode mask, usable with umask().
pg_file_create_mode, for file creation mask.
pg_dir_create_mode, for directory creation mask.

So you may want to patch your tool so as this is made extensible in a way consistent with PostgreSQL 11 or newer versions.

One last thing. Be careful of SSL certificates or such in the data folder when allowing group access as it could result in errors with the software doing the backup. Fortunately those can be located outside the data folder.

↧

Dimitri Fontaine: PostgreSQL Data Types: JSON

April 30, 2018, 12:49 am

≫ Next: Hans-Juergen Schoenig: Tech preview: PostgreSQL 11 – CREATE PROCEDURE

≪ Previous: Michael Paquier: Postgres 11 highlight - Group access on data folder

Continuing our series of PostgreSQL Data Types today we’re going to introduce the PostgreSQL JSON type.

PostgreSQL has built-in support for JSON with a great range of processing functions and operators, and complete indexing support. The documentation covers all the details in the chapters entitled JSON Types and JSON Functions and Operators.

↧

Hans-Juergen Schoenig: Tech preview: PostgreSQL 11 – CREATE PROCEDURE

April 30, 2018, 1:05 am

≫ Next: Gulcin Yildirim: Near-Zero Downtime Automated Upgrades of PostgreSQL Clusters in Cloud (Part II)

≪ Previous: Dimitri Fontaine: PostgreSQL Data Types: JSON

Many people have asked for this feature for years and PostgreSQL 11 will finally have it. I am of course talking about CREATE PROCEDURE. Traditionally PostgreSQL has provided all the means to write functions (which were often simply called “stored procedures”). However, in a function you cannot really run transactions – all you can do is to use exceptions, which are basically savepoints. Inside a function you cannot just commit a transaction or open a new one. CREATE PROCEDURE will change all that and provide you with the means to run transactions in procedural code.

Using CREATE PROCEDURE in PostgreSQL

CREATE PROCEDURE will allow you to write procedures just like in most other modern databases. The syntax is quite simple and definitely not hard to use:

db11=# \h CREATE PROCEDURE
Command: CREATE PROCEDURE
Description: define a new procedure
Syntax:
CREATE [ OR REPLACE ] PROCEDURE
    name ( [ [ argmode ] [ argname ] argtype [ { DEFAULT | = } default_expr ] [, ...] ] )
  { LANGUAGE lang_name
    | TRANSFORM { FOR TYPE type_name } [, ... ]
    | [ EXTERNAL ] SECURITY INVOKER | [ EXTERNAL ] SECURITY DEFINER
    | SET configuration_parameter { TO value | = value | FROM CURRENT }
    | AS 'definition'
    | AS 'obj_file', 'link_symbol'
} …

As you can see there are a couple of similarities to CREATE FUNCTION so things should be really easy for most end users.

The next example shows a simple procedure:

db11=# CREATE PROCEDURE test_proc()
       LANGUAGE plpgsql
AS $$
  BEGIN
    CREATE TABLE a (aid int);
    CREATE TABLE b (bid int);
    COMMIT;
    CREATE TABLE c (cid int);
    ROLLBACK;
  END;
$$;
CREATE PROCEDURE

The first thing to notice here is that there is a COMMIT inside the procedure. In classical PostgreSQL functions this is not possible for a simple reason. Consider the following code:

SELECT func(id) FROM large_table;

What would happen if some function call simply commits? Totally chaos would be the consequence. Therefore real transactions are only possible inside a “procedure”, which is never called the way a function is executed. Also: Note that there is more than just one transaction going on inside our procedure. A procedure is therefore more of a “batch job”.

The following example shows, how to call the procedure I have just implemented:

db11=# CALL test_proc();
CALL

The first two tables where committed – the third table has not been created because of the rollback inside the procedure.

db11=# \d
List of relations
 Schema | Name | Type  | Owner
--------+------+-------+-------
 public | a    | table | hs
 public | b    | table | hs

(2 rows)

To me CREATE PROCEDURE is definitely one of the most desirable features of PostgreSQL 11.0. The upcoming release will be great and many people will surely welcome CREATE PROCEDURE the way I do.

The post Tech preview: PostgreSQL 11 – CREATE PROCEDURE appeared first on Cybertec.

↧

Gulcin Yildirim: Near-Zero Downtime Automated Upgrades of PostgreSQL Clusters in Cloud (Part II)

April 30, 2018, 5:30 am

≫ Next: Peter Bengtsson: Best EXPLAIN ANALYZE benchmark script

≪ Previous: Hans-Juergen Schoenig: Tech preview: PostgreSQL 11 – CREATE PROCEDURE

I’ve started to write about the tool (pglupgrade) that I developed to perform near-zero downtime automated upgrades of PostgreSQL clusters. In this post, I’ll be talking about the tool and discuss its design details.

You can check the first part of the series here: Near-Zero Downtime Automated Upgrades of PostgreSQL Clusters in Cloud (Part I).

The tool is written in Ansible. I have prior experience of working with Ansible, and I currently work with it in 2ndQuadrant as well, which is why it was a comfortable option for me. That being said, you can implement the minimal downtime upgrade logic, which will be explained later in this post, with your favorite automation tool.

Further reading: Blog posts Ansible Loves PostgreSQL , PostgreSQL Planet in Ansible Galaxy and presentation Managing PostgreSQL with Ansible.

Pglupgrade Playbook

In Ansible, playbooks are the main scripts that are developed to automate the processes such as provisioning cloud instances and upgrading database clusters. Playbooks may contain one or more plays. Playbooks may also contain variables, roles, and handlers if defined.

The tool consists of two main playbooks. The first playbook is provision.yml that automates the process for creating Linux machines in cloud, according to the specifications (This is an optional playbook written only to provision cloud instances and not directly related with the upgrade). The second (and the main) playbook is pglupgrade.yml that automates upgrade process of database clusters.

Pglupgrade playbook has eight plays to orchestrate the upgrade. Each of the plays, use one configuration file (config.yml), perform some tasks on the hosts or host groups that are defined in host inventory file (host.ini).

Inventory File

An inventory file lets Ansible know which servers it needs to connect using SSH, what connection information it requires, and optionally which variables are associated with those servers. Below you can see a sample inventory file, that has been used to execute automated cluster upgrades for one of the case studies designed for the tool. We will discuss these case studies in upcoming posts of this series.

[old-primary]
54.171.211.188

[new-primary]
54.246.183.100

[old-standbys]
54.77.249.81
54.154.49.180

[new-standbys:children]
old-standbys

[pgbouncer]
54.154.49.180

Inventory File (host.ini)

The sample inventory file contains five hosts under five host groups that include old-primary, new-primary, old-standbys, new-standbys and pgbouncer. A server could belong to more than one group. For example, the old-standbys is a group containing the new-standbys group, which means the hosts that are defined under the old-standbys group (54.77.249.81 and 54.154.49.180) also belongs to the new-standbys group. In other words, the new-standbys group is inherited from (children of) old-standbys group. This is achieved by using the special :children suffix.

Once the inventory file is ready, Ansible playbook can run via ansible-playbook command by pointing to the inventory file (if the inventory file is not located in default location otherwise it will use the default inventory file) as shown below:

$ ansible-playbook -i hosts.ini pglupgrade.yml

Running an Ansible playbook

Configuration File

Pglupgrade playbook uses a configuration file (config.yml) that allows users to specify values for the logical upgrade variables.

As shown below, the config.yml stores mainly PostgreSQL-specific variables that are required to set up a PostgreSQL cluster such as postgres_old_datadir and postgres_new_datadir to store the path of the PostgreSQL data directory for the old and new PostgreSQL versions; postgres_new_confdir to store the path of the PostgreSQL config directory for the new PostgreSQL version; postgres_old_dsn and postgres_new_dsn to store the connection string for the pglupgrade_user to be able connect to the pglupgrade_database of the new and the old primary servers. Connection string itself is comprised of the configurable variables so that the user (pglupgrade_user) and the database (pglupgrade_database) information can be changed for the different use cases.

ansible_user: admin

pglupgrade_user: pglupgrade
pglupgrade_pass: pglupgrade123
pglupgrade_database: postgres

replica_user: postgres
replica_pass: ""

pgbouncer_user: pgbouncer

postgres_old_version: 9.5
postgres_new_version: 9.6

subscription_name: upgrade
replication_set: upgrade

initial_standbys: 1

postgres_old_dsn: "dbname={{pglupgrade_database}} host={{groups['old-primary'][0]}} user {{pglupgrade_user}}"
postgres_new_dsn: "dbname={{pglupgrade_database}} host={{groups['new-primary'][0]}} user={{pglupgrade_user}}"

postgres_old_datadir: "/var/lib/postgresql/{{postgres_old_version}}/main" 
postgres_new_datadir: "/var/lib/postgresql/{{postgres_new_version}}/main"

postgres_new_confdir: "/etc/postgresql/{{postgres_new_version}}/main"

Configuration File (config.yml)

As a key step for any upgrade, the PostgreSQL version information can be specified for the current version (postgres_old_version) and the version that will be upgraded to (postgres_new_version). In contrast to physical replication where the replication is a copy of the system at the byte/block level, logical replication allows selective replication where the replication can copy the logical data include specified databases and the tables in those databases. For this reason, config.yml allows configuring which database to replicate via pglupgrade_database variable. Also, logical replication user needs to have replication privileges, which is why pglupgrade_user variable should be specified in the configuration file. There are other variables that are related to working internals of pglogical such as subscription_name and replication_set which are used in the pglogical role.

High Availability Design of the Pglupgrade Tool

Pglupgrade tool is designed to give the flexibility in terms of High Availability (HA) properties to the user for the different system requirements. The initial_standbys variable (see config.yml) is the key for designating HA properties of the cluster while the upgrade operation is happening.

For example, if initial_standbys is set to 1 (can be set to any number that cluster capacity allows), that means there will be 1 standby created in the upgraded cluster along with the master before the replication starts. In other words, if you have 4 servers and you set initial_standbys to 1, you will have 1 primary and 1 standby server in the upgraded new version, as well as 1 primary and 1 standby server in the old version.

This option allows to reuse the existing servers while the upgrade is still happening. In the example of 4 servers, the old primary and standby servers can be rebuilt as 2 new standby servers after the replication finishes.

When initial_standbys variable is set to 0, there will be no initial standby servers created in the new cluster before the replication starts.

If the initial_standbys configuration sounds confusing, do not worry. This will be explained better in the next blog post when we discuss two different case studies.

Finally, the configuration file allows specifying old and new server groups. This could be provided in two ways. First, if there is an existing cluster, IP addresses of the servers (can be either bare-metal or virtual servers) should be entered into hosts.ini file by considering desired HA properties while upgrade operation.

The second way is to run provision.yml playbook (this is how I provisioned the cloud instances but you can use your own provisioning scripts or manually provision instances) to provision empty Linux servers in cloud (AWS EC2 instances) and get the IP addresses of the servers into the hosts.ini file. Either way, config.yml will get host information through hosts.ini file.

Workflow of the Upgrade Process

After explaining the configuration file (config.yml) which is used by pglupgrade playbook, we can explain the workflow of the upgrade process.

Pglupgrade Workflow

As it is seen from the diagram above, there are six server groups that are generated in the beginning based on the configuration (both hosts.ini and the config.yml). The new-primary and old-primary groups will always have one server, pgbouncer group can have one or more servers and all the standby groups can have zero or more servers in them. Implementation wise, the whole process is split into eight steps. Each step corresponds to a play in the pglupgrade playbook, which performs the required tasks on the assigned host groups. The upgrade process is explained through following plays:

Build hosts based on configuration: Preparation play which builds internal groups of servers based on the configuration. The result of this play (in combination with the hosts.ini contents) are the six server groups (illustrated with different colours in the workflow diagram) which will be used by the following seven plays.
Setup new cluster with initial standby(s): Sets up an empty PostgreSQL cluster with the new primary and initial standby(s) (if there are any defined). It ensures that there is no remaining from PostgreSQL installations from the previous usage.
Modify the old primary to support logical replication: Installs pglogical extension. Then sets the publisher by adding all the tables and sequences to the replication.
Replicate to the new primary: Sets up the subscriber on the new master which acts as a trigger to start logical replication. This play finishes replicating the existing data and starts catching up what has changed since it started the replication.
Switch the pgbouncer (and applications) to new primary: When the replication lag converges to zero, pauses the pgbouncer to switch the application gradually. Then it points pgbouncer config to the new primary and waits until the replication difference gets to zero. Finally, pgbouncer is resumed and all the waiting transactions are propagated to the new primary and start processing there. Initial standbys are already in use and reply read requests.
Clean up the replication setup between old primary and new primary: Terminates the connection between the old and the new primary servers. Since all the applications are moved to the new primary server and the upgrade is done, logical replication is no longer needed. Replication between primary and standby servers are continued with physical replication.
Stop the old cluster: Postgres service is stopped in old hosts to ensure no application can connect to it anymore.
Reconfigure rest of the standbys for the new primary: Rebuilds other standbys if there are any remaining hosts other than initial standbys. In the second case study, there are no remaining standby servers to rebuild. This step gives the chance to rebuild old primary server as a new standby if pointed in the new-standbys group at hosts.ini. The re-usability of existing servers (even the old primary) is achieved by using the two-step standby configuration design of the pglupgrade tool. The user can specify which servers should become standbys of the new cluster before the upgrade, and which should become standbys after the upgrade.

Conclusion

In this post, we discussed the implementation details and the high availability design of the pglupgrade tool. In doing so, we also mentioned a few key concepts of Ansible development (i.e. playbook, inventory and config files) using the tool as an example. We illustrated the workflow of the upgrade process and summarized how each step works with a corresponding play. We will continue to explain pglupgrade by showing case studies in upcoming posts of this series.

Thanks for reading!

↧

Peter Bengtsson: Best EXPLAIN ANALYZE benchmark script

April 19, 2018, 10:55 am

≫ Next: Don Seiler: Beware (Sort-Of) Ambiguous Column Names In Sub-Selects

≪ Previous: Gulcin Yildirim: Near-Zero Downtime Automated Upgrades of PostgreSQL Clusters in Cloud (Part II)

tl;dr; Use best-explain-analyze.py to benchmark a SQL query in Postgres.

I often benchmark SQL by extracting the relevant SQL string, prefix it with EXPLAIN ANALYZE, putting it into a file (e.g. benchmark.sql) and then running psql mydatabase < benchmark.sql. That spits out something like this:

QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------- Index Scan using main_song_ca949605 on main_song (cost=0.43..237.62 rows=1 width=4) (actual time=1.586..1.586 rows=0 loops=1) Index Cond: (artist_id = 27451) Filter: (((name)::text % 'Facing The Abyss'::text) AND (id

 2856345))
   Rows Removed by Filter: 170
 Planning time: 3.335 ms
 Execution time: 1.701 ms
(6 rows)

Cool. So you study the steps of the query plan and look for "Seq Scan" and various sub-optimal uses of heaps and indices etc. But often, you really want to just look at the Execution time milliseconds number. Especially if you might have to slightly different SQL queries to compare and contrast.

However, as you might have noticed, the number on the Execution time varies between runs. You might think nothing's changed but Postgres might have warmed up some internal caches or your host might be more busy or less busy. To remedy this, you run the EXPLAIN ANALYZE select ... a couple of times to get a feeling for an average. But there's a much better way!

`best-explain-analyze.py`

Check this out: best-explain-analyze.py

Download it into your ~/bin/ and chmod +x ~/bin/best-explain-analyze.py. I wrote it just this morning so don't judge!

Now, when you run it it runs that thing 10 times (by default) and reports the best Execution time, its mean and its median. Example output:

▶ best-explain-analyze.py songsearch dummy.sql
EXECUTION TIME
    BEST    1.229ms
    MEAN    1.489ms
    MEDIAN  1.409ms
PLANNING TIME
    BEST    1.994ms
    MEAN    4.557ms
    MEDIAN  2.292ms

The "BEST" is an important metric. More important than mean or median.

Raymond Hettinger explains it better than I do. His context is for benchmarking Python code but it's equally applicable:

"Use the min() rather than the average of the timings. That is a recommendation from me, from Tim Peters, and from Guido van Rossum. The fastest time represents the best an algorithm can perform when the caches are loaded and the system isn't busy with other tasks. All the timings are noisy -- the fastest time is the least noisy. It is easy to show that the fastest timings are the most reproducible and therefore the most useful when timing two different implementations."

↧

Don Seiler: Beware (Sort-Of) Ambiguous Column Names In Sub-Selects

April 30, 2018, 10:55 am

≫ Next: Pavel Stehule: new presentation - plpgsql often issues

≪ Previous: Peter Bengtsson: Best EXPLAIN ANALYZE benchmark script

This morning I received an UPDATE statement from a developer that I was testing. It ran without errors but then I saw that it updated 5 rows when it should have only updated 3. The reason gave me a little shock so I whipped up a simple test-case to reproduce the problem.

First we create two tables:

CREATE TABLE foo (
id int
, name varchar(30)
);
CREATE TABLE
CREATE TABLE bar (
id int
, foo_id int
, description varchar(100)
);
CREATE TABLE

Then we insert some data:

INSERT INTO foo (id, name) VALUES
(1, 'Dev')
, (2, 'QA')
, (3, 'Preprod')
, (4, 'Prod');
INSERT 0 4
INSERT INTO bar (id, foo_id, description)
VALUES (1, 1, 'A')
, (2, 2, 'B')
, (3, 2, 'C')
, (4, 2, 'D')
, (5, 3, 'E');
INSERT 0 5

Here I'm using a SELECT rather than the original UPDATE just to test. This could (should) be done as a join, but I was sent something like this:

SELECT COUNT(*)
FROM bar
WHERE foo_id = (SELECT id FROM foo WHERE name='QA');
count
-------
3
(1 row)

Fair enough. It does the same thing as a join. However what I was sent was actually this (note the column name in the subquery):

SELECT COUNT(*)
FROM bar
WHERE foo_id = (SELECT foo_id FROM foo WHERE name='QA');
count
-------
5
(1 row)

I would expect an error since foo_id does not exist in table foo, like this:

SELECT foo_id FROM foo WHERE name='QA';
ERROR: 42703: column "foo_id" does not exist
LINE 1: SELECT foo_id FROM foo WHERE name='QA';

Instead, it basically selected EVERYTHING in the bar table. Why?

I posted this dilemma in the PostgreSQL Slack channel, and others were similarly surprised by this. Ryan Guill tested and confirmed the same behavior not only in Postgres, but also in Oracle, MS SQL, & MySQL.

Cindy Wise observed that it is probably using the foo_id field from the bar table in the outer query, which does make sense. It's comparing foo_id to itself (while also running the now-pointless subquery to foo table), which will of course return true, so it grabs every row in the bar table.

This seems like a very easy trap to fall into if you're not careful with your column names. Considering this was originally in the form of an UPDATE query, it can be a destructive mistake that would execute successfully and could be rather hard to trace back.

↧

Pavel Stehule: new presentation - plpgsql often issues

May 1, 2018, 7:30 am

≫ Next: Hubert 'depesz' Lubaczewski: Waiting for PostgreSQL 11 – Support partition pruning at execution time

≪ Previous: Don Seiler: Beware (Sort-Of) Ambiguous Column Names In Sub-Selects

My presentation from last PostgreSQL meetup - https://postgres.cz/files/plpgsql_issues.pdf

↧

Hubert 'depesz' Lubaczewski: Waiting for PostgreSQL 11 – Support partition pruning at execution time

May 1, 2018, 9:41 am

≫ Next: Dimitri Fontaine: PostgreSQL Data Types: ENUM

≪ Previous: Pavel Stehule: new presentation - plpgsql often issues

On 7th of April 2018, Alvaro Herrera committed patch: Support partition pruning at execution time Existing partition pruning is only able to work at plan time, for query quals that appear in the parsed query. This is good but limiting, as there can be parameters that appear later that can be usefully used to further […]

↧

Parallelism settings and how to accelerate your database performance.

Making decisions using open source principles

One Giant Leap For SQL: MySQL 8.0 Released

Resolution

Window Functions

Common Table Expressions (with [recursive])

Other Standard SQL Features

Other Notable Enhancements

Oracle, PostgreSQL and the cloud

Migrating to PostgreSQL

One word about Oracle stored procedures

ora_migrator is free

Using CREATE PROCEDURE in PostgreSQL

Pglupgrade Playbook

Inventory File

Configuration File

High Availability Design of the Pglupgrade Tool

Workflow of the Upgrade Process

Conclusion

best-explain-analyze.py

Common Table Expressions (`with [recursive]`)

`best-explain-analyze.py`