Robert Haas: PostgreSQL Regression Test Coverage

May 26, 2016, 1:49 pm

≫ Next: Markus Winand: filter — Selective Aggregates

≪ Previous: Nikolay Shaplov: postgres: reloption ALTER INDEX bug

Yesterday evening, I ran the PostgreSQL regression tests (make check-world) on master and on each supported back-branch three times on hydra, a community test machine provided by IBM. Here are the median results:

9.1 - 3m49.942s
9.2 - 5m17.278s
9.3 - 6m36.609s
9.4 - 9m48.211s
9.5 - 8m58.544s
master, or 9.6 - 13m16.762s
Read more »

↧

Markus Winand: filter — Selective Aggregates

May 26, 2016, 5:00 pm

≫ Next: Keith Fiske: Checking for PostgreSQL Bloat

≪ Previous: Robert Haas: PostgreSQL Regression Test Coverage

The filter clause extends aggregate functions (sum, avg, count, …) by an additional where clause. The result of the aggregate is built from only the rows that satisfy the additional where clause too.

Syntax

The filter clause follows an aggregate function:

SUM(<expression>) FILTER(WHERE <condition>)

With the exception of subqueries and window functions, the <condition> may contain any expression that is allowed in regular where clauses⁰.

The filter clause works for any aggregate function: besides the well-known functions such as sum and count, it also works for array_agg and sorted set functions (e.g., percentile_cont).

If an aggregate function is used as a window function (over clause), the syntactic order is: aggregate function, filter clause, over clause:

SUM(...) FILTER(WHERE ...) OVER (...)

However, the filter clause is not generally allowed before overrather, it is only allowed after an aggregate function, but not after other window functions—it is not allowed before ranking functions (rank, dense_rank, etc.) for example.

Use Cases

The following articles describe common use cases of filter:

Pivot—Rows to Columns — filter in the select clause
More to come: Subscribe to the newsletter!

Compatibility

SQL:2003 introduced the filter clause as part of the optional feature“Advanced OLAP operations” (T612). It is barely supported today, but is easy to emulate using case (see Conforming Alternatives).

Availability of FILTER

Conforming Alternatives

Generally, the filter clause can be implemented as a case expression inside the aggregate function: the filter condition has to be put into the when-clause, the value to be aggregated into the then clause. Because aggregate functions generally skip over null values¹, the implicit else null clause is enough to ignore non-matching rows. The following two expressions are equivalent:

SUM(<expression>) FILTER(WHERE <condition>)

SUM(CASE WHEN <condition> THEN <expression> END)

Count(*) needs some special treatment because “*” cannot be put into the then clause. Instead, it is enough to use a non-null constant value. This ensures that every matching row is counted. The implicit else null clause maps non-matching rows to null, which is ignored by count too.

COUNT(*) FILTER (WHERE <condition>)

COUNT(CASE WHEN <condition> THEN 1 END)

When using a set quantifier (distinct or all) it must remain in the aggregate function prior the case expression.

Proprietary Extensions

PostgreSQL: Subqueries Allowed

The PostgreSQL database supports subqueries inside the filter clause (e.g., via exists).

“filter — Selective Aggregates” by Markus Winand was originally published at modern SQL.

↧

Keith Fiske: Checking for PostgreSQL Bloat

May 27, 2016, 8:55 am

≫ Next: Shaun M. Thomas: PG Phriday: Converting to Horizontal Distribution

≪ Previous: Markus Winand: filter — Selective Aggregates

My post almost 2 years ago about checking for PostgreSQL bloat is still one of the most popular ones on my blog (according to Google Analytics anyway). Since that’s the case, I’ve gone and changed the URL to my old post and reused that one for this post. I’d rather people be directed to correct and current information as quickly as possible instead of adding an update to my old post pointing to a new one. I’ve included my summary on just what exactly bloat is again below since that seemed to be the most popular part.

The intent of the original post was to discuss a python script I’d written for monitoring bloat status: pg_bloat_check.py. Since that time, I’ve been noticing that the query used in v1.x of that script (obtained from the check_postgres.pl module) was not always accurate and was often not reporting on bloat that I knew for a fact was there (Ex: I just deleted over 300 million rows, vacuumed & analyzed the table and still no bloat? Sure it could happen, but highly unlikely). So I continued looking around and discovered the pgstattuple contrib module that comes with PostgreSQL. After discussing it with several of the core developers at recent PostgreSQL conferences (PGConfUS& PGCon) I believe this is a much, much better way to get an accurate assessment of the bloat situation. This encouraged me to do a near complete rewrite of my script and v2.0.0 is now available. It’s not a drop-in replacement for v1.x, so please check the –help for new options.

pgstattuple is a very simple, but powerful extension. It doesn’t require any additional libraries to be loaded and just adds a few functions you can call on database objects to get some statistics about them. The key function for bloat being the default one, pgstattuple(regclass), which returns information about live & dead tuples and free space contained in the given object. If you read the description below on what bloat actually is, you’ll see that those data points are exactly what we’re looking for. The difference between what this function is doing and what the check_postgres.pl query is doing is quite significant, though. The check_postgres query is doing its best to guess what is dead & free space based on the current statistics in the system catalogs. pgstattuple actually goes through and does a full scan on the given table or index to see what the actual situation is. This does mean this query can be very, very slow on large tables. The database I got the examples below from is 1.2TB and a full bloat check on it takes just under 1 hour. But with the inaccuracies I’ve seen being returned by the simpler query, this time can be well worth it. The script stores the statistics gathered in a table so they can be easily reviewed at any time and even used for monitoring purposes, just like check_postgres.

Before showing what the script can do, I just want to re-iterate some things from my old post because they’re important. Bloat percentage alone is a poor indicator of actual system health. Small tables may always have a higher than average bloat, or there may always be 1 or 2 pages considered waste, and in reality that has next to zero impact on database performance. Constantly “debloating” them is more a waste of time than the space used. So the script has some filters for object size, wasted space and wasted percentage. This allows the final output of the bloat report to provide a more accurate representation of where there may actually be problems that need to be looked into.

Another option is a filter for individual tables or indexes to be ignored. If you understand why bloat happens, you will come across cases where a table is stuck at a certain bloat point at all times, no matter how many times you VACUUM FULL it or run pg_repack on it (those two things do remove it, but it quickly comes back). This happens with tables that have a specific level of churn with the rows being inserted, updated & deleted. The number of rows being updated/deleted is balanced with the number of rows being inserted/updated as well as the autovacuum schedule to mark space for reuse. Removing the bloat from tables like this can actually cause decreased performance because instead of re-using the space that VACUUM marks as available, Postgres has to again allocate more pages to that object from disk first before the data can be added. So bloat is actually not always a bad thing and the nature of MVCC can lead to improved write performance on some tables. On to the new script!

So as an example of why this new, slower method can be worth it, here’s the bloat report for a table and its indexes from the old script using check_postgres

Old Table Bloat:
2. public.group_members.........................................................(9.6%) 4158 MB wasted

Old Index Bloat:
1. public.group_members_id_pk..................................................(19.5%) 4753 MB wasted
3. public.group_members_user_id_idx.............................................(9.6%) 2085 MB wasted
5. public.group_members_deleted_at_idx..........................................(6.2%) 1305 MB wasted

Here’s the results from the statistic table in the new version

$ pg_bloat_check.py -c "dbname=prod" -t public.group_members

kfiske@prod=# select objectname, pg_size_pretty(size_bytes) as object_size, pg_size_pretty(free_space_bytes) as reusable_space, pg_size_pretty(dead_tuple_size_bytes) dead_tuple_space, free_percent from bloat_stats ;
            objectname             | object_size | reusable_space | dead_tuple_space | free_percent
-----------------------------------+-------------+----------------+------------------+--------------
 group_members                     | 42 GB       | 16 GB          | 4209 kB          |        37.84
 group_members_user_id_idx         | 21 GB       | 14 GB          | 1130 kB          |        64.79
 group_members_id_pk               | 24 GB       | 16 GB          | 4317 kB          |        68.96
 group_members_deleted_at_idx      | 20 GB       | 13 GB          | 3025 kB          |        63.77
 group_members_group_id_user_id_un | 11 GB       | 4356 MB        | 6576 bytes       |        38.06
 group_members_group_id_idx        | 17 GB       | 9951 MB        | 0 bytes          |         56.8
 group_members_updated_at_idx      | 15 GB       | 7424 MB        | 0 bytes          |        49.57

Yes, all those indexes did exist before. The old query just didn’t think they had any bloat at all. There’s also a nearly 4x difference in wasted space in the table alone. It’s only 37% of the table in this case, but if you’re trying to clean up bloat due to low disk space, 12GB can be a lot. Another really nice thing pgstattuple provides is a distinction between dead tuples and reusable (free) space. You can see the dead tuple space is quite low in this example. That means autovacuum is running efficiently on this table and marking dead rows from updates & deletes as re-usable. If you see dead tuples is high, that could indicate autovacuum is not running properly and you may need to adjust some of the vacuum tuning parameters that are available. In this case, even a normal vacuum was not freeing the reusable space back to the operating system. See below for why this is. This means either a VACUUM FULL or pg_repack run is required to reclaim it. Here’s the result from making a new index on user_id:

kfiske@prod=# CREATE INDEX concurrently ON group_members USING btree (user_id);
CREATE INDEX
Time: 5308849.412 ms

$ pg_bloat_check.py -c "dbname=prod" -t public.group_members

kfiske@prod=# select objectname, pg_size_pretty(size_bytes) as object_size, pg_size_pretty(free_space_bytes) as reusable_space, pg_size_pretty(dead_tuple_size_bytes) dead_tuple_space, free_percent from bloat_stats ;
            objectname             | object_size | reusable_space | dead_tuple_space | free_percent 
-----------------------------------+-------------+----------------+------------------+--------------
 group_members                     | 42 GB       | 16 GB          | 2954 kB          |        37.84
 group_members_user_id_idx         | 21 GB       | 14 GB          | 1168 kB          |        64.79
 group_members_id_pk               | 24 GB       | 16 GB          | 4317 kB          |        68.96
 group_members_deleted_at_idx      | 20 GB       | 13 GB          | 3025 kB          |        63.77
 group_members_group_id_user_id_un | 11 GB       | 4356 MB        | 6784 bytes       |        38.06
 group_members_group_id_idx        | 17 GB       | 9951 MB        | 0 bytes          |         56.8
 group_members_updated_at_idx      | 15 GB       | 7424 MB        | 0 bytes          |        49.57
 group_members_user_id_idx1        | 8319 MB     | 817 MB         | 336 bytes        |         9.83

You can see the new index group_members_user_id_idx1 is now down to only 9% wasted space and much smaller. Here’s the result after running pg_repack to clear both the table and all index bloat:

kfiske@prod=# select objectname, pg_size_pretty(size_bytes) as object_size, pg_size_pretty(free_space_bytes) as reusable_space, pg_size_pretty(dead_tuple_size_bytes) dead_tuple_space, free_percent from bloat_stats ;
            objectname             | object_size | reusable_space | dead_tuple_space | free_percent 
-----------------------------------+-------------+----------------+------------------+--------------
 group_members                     | 25 GB       | 27 MB          | 79 kB            |          0.1
 group_members_id_pk               | 8319 MB     | 818 MB         | 0 bytes          |         9.83
 group_members_user_id_idx         | 8319 MB     | 818 MB         | 0 bytes          |         9.83
 group_members_deleted_at_idx      | 8319 MB     | 818 MB         | 0 bytes          |         9.83
 group_members_group_id_user_id_un | 7818 MB     | 768 MB         | 0 bytes          |         9.83
 group_members_group_id_idx        | 8319 MB     | 818 MB         | 0 bytes          |         9.83
 group_members_updated_at_idx      | 8318 MB     | 818 MB         | 0 bytes          |         9.83
(7 rows)

PostgreSQL 9.5 introduced the pgstattuple_approx(regclass) function which tries to take advantage of some visibility map statistics to increase the speed of gathering tuple statistics but possibly sacrificing some accuracy since it’s not hitting each individual tuple. It only works on tables, though. This option is available with the script using the –quick argument. There’s also the pgstatindex(regclass) that gives some more details on index pages and how the data in them is laid out, but I haven’t found a use for that in the script yet.

The same output options the old script had are still available: –simple to provide a text summary useful for emails & –dict which is a python dictionary that provides a structured output and also greater details on the raw statistics (basically just the data straight from the table). The table inside the database provides a new, easy method for reviewing the bloat information as well, but just be aware this is rebuilt from scratch every time the script runs. There’s also a new option which I used above (-t, –tablename) that you can use to get the bloat information on just a single table. See the –help for more information on all the options that are available.

Why Bloat Happens

For those of you newer to PostgreSQL administration, and this is the first time you may be hearing about bloat, I figured I’d take the time to explain why this scenario exists and why tools like this are necessary (until they’re hopefully built into the database itself someday). It’s something most don’t understand unless someone first explains it to them or you run into the headaches it causes when it’s not monitored and you learn about it the hard way.

MVCC (multi-version concurrency control) is how Postgres has chosen to deal with multiple transactions/sessions hitting the same rows at (nearly) the same time. The documentation, along with wikipedia provide excellent and extensive explanations of how it all works, so I refer you there for all the details. Bloat is a result of one particular part of MVCC, concentrated around the handling of updates and deletes.

Whenever you delete a row, it’s not actually deleted, it is only marked as unavailableto all future transactions taking place after the delete occurs. The same happens with an update: the old version of a row is kept active until all currently running transactions have finished, then it is marked as unavailable. I emphasize the word unavailable because the row still exists on disk, it’s just not visible any longer. The VACUUM process in Postgres then comes along and marks any unavailable rows as space that is now available for future inserts or updates. The auto-vacuum process is configured to run VACUUM automatically after so many writes to a table (follow the link for the configuration options), so it’s not something you typically have to worry about doing manually very often (at least with more modern versions of Postgres).

People often assume that VACUUM is the process that should return the disk space to the file system. It does do this but only in very specific cases. That used space is contained in page files that make up the tables and indexes (called objects from now on) in the Postgres database system. Page files all have the same size and differently sized objects just have as many page files as they need. If VACUUM happens to mark every row in a page file as unavailable AND that page also happens to be the final page for the entire object, THEN the disk space is returned to the file system. If there is a single available row, or the page file is any other but the last one, the disk space is never returned by a normal VACUUM. This is bloat. Hopefully this explanation of what bloat actually is shows you how it can sometimes be advantageous for certain usage patterns of tables as well, and why I’ve included the option to ignore objects in the report.

If you give the VACUUM command the special flag FULL, then all of that reusable space is returned to the file system. But VACUUM FULL does this by completely rewriting the entire table (and all its indexes) to new pages and takes an exclusive lock on the table the entire time it takes to run (CLUSTER does the same thing, but what that does is outside the scope of this post). For large tables in frequent use, this is problematic. pg_repack has been the most common tool we’ve used to get around that. It recreates the table in the background, tracking changes to it, and then takes a brief lock to swap the old bloated table with the new one.

Why bloat is actually a problem when it gets out of hand is not just the disk space it uses up. Every time a query is run against a table, the visibility flags on individual rows and index entries is checked to see if is actually available to that transaction. On large tables (or small tables with a lot of bloat) that time spent checking those flags builds up. This is especially noticeable with indexes where you expect an index scan to improve your query performance and it seems to be making no difference or is actually worse than a sequential scan of the whole table. And this is why index bloat is checked independently of table bloat since a table could have little to no bloat, but one or more of its indexes could be badly bloated. Index bloat (as long as it’s not a primary key) is easier to solve because you can either just reindex that one index, or you can concurrently create a new index on the same column and then drop the old one when it’s done.

In all cases when you run VACUUM, it’s a good idea to run ANALYZE as well, either at the same time in one command or as two separate commands. This updates the internal statistics that Postgres uses when creating query plans. The number of live and dead rows in a table/index is a part of how Postgres decides to plan and run your queries. It’s a much smaller part of the plan than other statistics, but every little bit can help.

I hope this explanation of what bloat is, and how this tool can help with your database administration, has been helpful.

↧

Shaun M. Thomas: PG Phriday: Converting to Horizontal Distribution

May 27, 2016, 11:28 am

≫ Next: Oleg Bartunov: Slides from PGCon-2016

≪ Previous: Keith Fiske: Checking for PostgreSQL Bloat

Now that we’ve decided to really start embracing horizontal scaling builds, there is a critically important engine-agnostic element we need to examine. Given an existing table, how exactly should we split up the contents across our various nodes during the conversion process? Generally this is done by selecting a specific column and applying some kind of hash or custom distribution mechanism to ensure all node contents are reasonably balanced. But how do we go about figuring that out?

This question is usually answered with “use the primary key!” But this gets a bit more complex in cases where tables rely on composite keys. This doesn’t happen often, but can really throw a wrench into the works. Imagine for example, we’re using Postgres-XL and have four nodes numbered data0000 through data0003. Then we find this table:

CREATETABLE comp_event
(
  group_code   TEXT NOTNULL,
  event_id     BIGINTNOTNULL,
  entry_tm     TIMETZ NOTNULL,
  some_data    TEXT NOTNULL);
 
INSERTINTO comp_event
SELECT a.id % 10, a.id % 100000,'08:00'::TIMETZ +(a.id % 43200||'s')::INTERVAL,
       repeat('a', a.id%10)FROM generate_series(1,1000000) a(id);
 
ALTERTABLE comp_event ADDCONSTRAINT pk_comp_event
      PRIMARYKEY(group_code, event_id, entry_tm);
 
ANALYZE comp_event;

The default for Postgres-XL is to simply use the first column for distribution. This tends to fit most cases, as the first column is usually either the primary key, or a reasonable facsimile of it. We can even use a system view to confirm this is the case:

SELECT pcrelid::regclass ASTABLE_NAME, a.attname AS column_name
  FROM pgxc_class c
  JOIN pg_attribute a ON(a.attrelid = c.pcrelid)WHERE a.attnum = c.pcattnum;
 
 TABLE_NAME| column_name 
------------+-------------
 comp_event | group_code

But is this what we want? What would happen if we naively went ahead with the default value and converted the database? Well, the major problem is that we don’t know the hash algorithm Postgres-XL is using. It’s entirely possible that the resulting data distribution will be anywhere from “slightly off” to “completely awful,” and we need a way to verify uniform distribution before moving forward.

In the case of Postgres-XL, we can actually poll each node directly with EXECUTE DIRECT. Repeatedly executing the same query and just substituting the node name is both inefficient and cumbersome, especially if we have dozens or hundreds of nodes. Thankfully Postgres makes it easy to create functions that return sets, so let’s leverage that power in our favor:

CREATETYPE pgxl_row_dist AS(node_name TEXT, total BIGINT);
 
CREATEORREPLACEFUNCTION check_row_counts(tab_name REGCLASS)RETURNS SETOF pgxl_row_dist AS
$BODY$
DECLARE
  r pgxl_row_dist;
  query TEXT;
BEGINFOR r.node_name INSELECT node_name
        FROM pgxc_node WHERE node_type ='D'
  LOOP
    query ='EXECUTE DIRECT ON ('|| r.node_name ||') 
      ''SELECT count(*) FROM '|| tab_name::TEXT ||'''';
    EXECUTE query INTO r.total;
    RETURNNEXT r;
  END LOOP;
END;
$BODY$ LANGUAGE plpgsql;

This function should exist in some form with the standard Postgres-XL distribution. Unfortunately if it does, I couldn’t find any equivalent. Regardless, with this in hand, we can provide a table name and see how many rows exist on each node no matter our cluster size. For our four node cluster, each node should have about 250,000 rows, give or take some variance caused by the hashing algorithm. Let’s see what the distribution actually resembles:

SELECT*FROM check_row_counts('comp_event');
 
 node_name | total  
-----------+--------
 data0000  |600000
 data0001  |200000
 data0002  |200000
 data0003  |0

That’s… unfortunate. The table doesn’t list its columns in order of cardinality since that’s never been a concern before now. Beyond that, the first column is part of our primary key, so it makes sense to be listed near the top anyway. Position is hardly a reliable criteria beyond a first approximation, so how do we fix this?

Let’s examine the Postgres statistics catalog for the comp_event table, and see how cardinality is actually represented:

SELECT attname, n_distinct
  FROM pg_stats
 WHERE tablename ='comp_event';
 
  attname   | n_distinct 
------------+------------
 group_code |10
 event_id   |12471.5
 entry_tm   |-0.158365
 some_data  |10

The sample insert statement we used to fill comp_event should have already made this clear, but not everything is an example. If we assume the table already existed, or we loaded it with from multiple sources or scripts, the statistics would be our primary guide.

In this particular case, the event_id or entry_tm columns would be much better candidates to achieve balanced distribution. For now, let’s just keep things simple and use the event_id column since the primary difference is the cardinality. There’s no reason to introduce multiple variables such as column type quite yet.

Let’s check our row totals after telling Postgres-XL we want to use event_id for hashing:

TRUNCATETABLE comp_event;
ALTERTABLE comp_event DISTRIBUTE BY HASH (event_id);
 
INSERTINTO comp_event
SELECT a.id % 10, a.id % 100000,'08:00'::TIMETZ +(a.id % 43200||'s')::INTERVAL,
       repeat('a', a.id%10)FROM generate_series(1,1000000) a(id);
 
SELECT*FROM check_row_counts('comp_event');
 
 node_name | total  
-----------+--------
 data0000  |250050
 data0001  |249020
 data0002  |249730
 data0003  |251200

Much better! Now our queries will retrieve data from all four nodes, and the first node isn’t working three times harder than the others. If we had gone into production using the previous distribution, our cluster would be unbalanced and we’d be chasing performance problems. Or if we figured this out too late, we’d have to rebalance all of the data, which can take hours or even days depending on row count. No thanks!

It’s important to do this kind of analysis before moving data into a horizontally capable cluster. The Postgres pg_stats table makes that easy to accomplish. And if repeating this process for every table is too irritating, we can even do it in bulk. Let’s construct an unholy abomination that returns the primary key column with the highest cardinality for all tables:

SELECTDISTINCTON(schemaname, tablename)
       schemaname, tablename, attname
  FROM(SELECT s.schemaname, c.relname AS tablename,
           a.attname, i.indisprimary, i.indisunique,SUM(s.n_distinct)AS total_values
      FROM pg_index i
      JOIN pg_attribute a ON(
               a.attrelid = i.indrelid AND
               a.attnum = ANY(i.indkey))JOIN pg_class c ON(c.oid = i.indrelid)JOIN pg_namespace n ON(n.oid = c.relnamespace)JOIN pg_stats s ON(
               s.schemaname = n.nspname AND
               s.tablename = c.relname AND
               s.attname = a.attname
           )WHERE i.indisunique
       AND s.schemaname NOTIN('pg_catalog','information_schema')GROUPBY1,2,3,4,5) cols
ORDERBY schemaname, tablename,CASEWHEN total_values <0THEN-total_values * 9e20
           ELSE total_values ENDDESC,
      indisprimary DESC, indisunique DESC;

Gross! But at least we only have to do that once or twice before restoring all of our data in the new horizontally scaled cluster. We could even make the query uglier and have it generate our ALTER TABLE statements so we don’t need to manually correct the distribution of every table. And don’t forget that this process applies to nearly all distribution mechanisms which depend on column contents, not just Postgres-XL. Just do your due diligence, and everything should work out.

Happy scaling!

↧

Oleg Bartunov: Slides from PGCon-2016

May 28, 2016, 3:03 pm

≫ Next: Rubens Souza: JSONB and PostgreSQL 9.5: with even more powerful tools!

≪ Previous: Shaun M. Thomas: PG Phriday: Converting to Horizontal Distribution

I don't know, when our FTS-slides will be available from pgcon page, but meanwhile the presentation could be downloaded directly
http://www.sai.msu.su/~megera/postgres/talks/pgcon-2016-fts.pdf

We'll update github repository for rum once we'll fixed some small bugs.

↧

Rubens Souza: JSONB and PostgreSQL 9.5: with even more powerful tools!

May 30, 2016, 2:00 am

≫ Next: Nikolay Shaplov: postgres: setting reloptions to TOAST that does not exist

≪ Previous: Oleg Bartunov: Slides from PGCon-2016

PostgreSQL 9.5 has introduced new JSONB functionalities, greatly improving its already present NoSQL characteristics. With the inclusion of new operators and functions, now it is possible to easily modify the JSONB data. In this article these new modifiers will be presented together with some examples of how to use them.

With the inclusion of the JSON data type in its 9.2 release, PostgreSQL finally started supporting JSON natively. Although with this release it was possible to use Postgres as a “NoSQL” database, not much could actually be done at the time due to the lack of operators and interesting functions. Since 9.2, JSON support has been improving significantly in each new version of PostgreSQL, resulting today in the complete overcome of the initial limitations.

jasonb

Probably, the most remarkable improvements were the addition of the JSONB data type in Postgres 9.4 and, in the current Postgres 9.5 release, the introduction of new operators and functions that permit you to modify and manipulate JSONB data.

In this article we will focus on the new capabilities brought by Postgres 9.5. However, before diving into that, if you want to know more about the differences between the JSON and JSONB data types, or if you have doubts with respect to a “NoSQL” database being a good solution in your use case (which, you should ;)), I suggest you read the previous articles we have written regarding these topics:

NoSQL with PostgreSQL 9.4 and JSONB by Giuseppe Broccolo
JSONB type performance in PostgreSQL 9.4 by Marco Nenciarini
PostgreSQL anti-patterns: Unnecessary json/hstore dynamic columns by Craig Ringer

The new JSONB Operators

The operators and functions present in PostgreSQL until 9.4 only made it possible to extract JSONB data. Therefore, to actually modify this data, one would extract it, modify it, and then reinsert the data. Not too practical, some would say.

The new operators included in PostgreSQL 9.5, which were based on the jsonbx extension for PostgreSQL 9.4 have changed this, greatly improving how to handle the JSONB data.

Concatenate with ||

You can now concatenate two JSONB objects using the || operator:

SELECT'{"name": "Marie",      "age": 45}'::jsonb || '{"city": "Paris"}'::jsonb;

                      ?column?
   ----------------------------------------------
    {"age": 45, "name": "Marie", "city": "Paris"}
   (1row)

In the example above, the key town is appended to the first JSONB object.

It can also be used to overwrite already existing values:

SELECT'{"city": "Niceland",      "population": 1000}'::jsonb || '{"population": 9999}'::jsonb;

                    ?column?
   -------------------------------------------
    {"city": "Niceland", "population": 9999}
   (1row)

In this case, the value of the key population was overwritten by the value of the second object.

Delete with -

The - operator can remove a key/value pair from a JSONB object:

SELECT'{"name": "Karina",      "email": "karina@localhost"}'::jsonb - 'email';

         ?column?
    -------------------
     {"name": "Karina"}
    (1row)

As you can see, the key email specified by the operator - was removed from the object.

It is also possible to remove an element from an array:

SELECT'["animal","plant","mineral"]'::jsonb - 1;

       ?column?
   -----------------
    ["animal", "mineral"]
   (1row)

The example above shows an array containing 3 elements. Knowing that the first element in an array corresponds to the position 0 ( animal ), the - operator specifies the element at position 1 to be removed, and consequently removes plant from the array.

Delete with #-

The difference in comparing against the - operator is that with the #- operator, a nested key/value pair can be removed, if the path to be followed is provided:

SELECT'{"name": "Claudia",      "contact": {          "phone": "555-5555",          "fax": "111-1111"}}'::jsonb #- '{contact,fax}'::text[];

                           ?column?
   ---------------------------------------------------------
    {"name": "Claudia", "contact": {"phone": "555-5555"}}
   (1row)

Here, the fax key is nested within contact. We use the #- operator to indicate the path to the fax key in order to remove it.

The new JSONB functions

For more data processing power to edit JSONB data instead of only deleting or overwriting it, we can now use the new JSONB function:

jsonb_set

The new jsonb_set processing function allows to update the value for a specific key:

SELECT
    jsonb_set(
        '{"name": "Mary",          "contact":              {"phone": "555-5555",               "fax": "111-1111"}}'::jsonb,
        '{contact,phone}',
        '"000-8888"'::jsonb,
        false);

                                    jsonb_replace
   --------------------------------------------------------------------------------
    {"name": "Mary", "contact": {"fax": "111-1111", "phone": "000-8888"}}
   (1row)

It is easier to understand the above example knowing the structure of the jsonb_set function. It has 4 arguments:

target jsonb: The JSONB value to be modified
path text[]: The path to the value target to be changed, represented as a text array
new_value jsonb: The new key/value pair to be modified (added or changed)
create_missing boolean: An optional field that allows the creation of the new key/value if it doesn’t yet exist

Looking back at the previous example, now understanding its structure, we can see that the nested phone key within contact has been changed by the jsonb_set.

Here is one more example, now creating a new key through the use of the true boolean parameter (4th argument on the jsonb_set structure). As mentioned before, this argument defaults to true, thus it is not necessary to explicitly declare it on the next example:

SELECT
    jsonb_set(
        '{"name": "Mary",          "contact":              {"phone": "555-5555",               "fax": "111-1111"}}'::jsonb,
        '{contact,skype}',
        '"maryskype"'::jsonb,
        true);

                                                 jsonb_set
   ------------------------------------------------------------------------------------------------------
    {"name": "Mary", "contact": {"fax": "111-1111", "phone": "555-5555", "skype": "maryskype"}}
   (1row)

The skype key/value pair, which wasn’t present in the original JSONB object, was added and is nested within contact accordingly to the path specified in the 2nd argument of the jsonb_set structure.

If, instead of true on the 4th argument of jsonb_set, we have set it to false, the skype key wouldn’t be added to the JSONB object.

jsonb_pretty

Reading a JSONB entry is not that easy considering that it doesn’t preserve white spaces. The jsonb_pretty function formats the output, making it easier to be read:

SELECT
    jsonb_pretty(
        jsonb_set(
            '{"name": "Joan",              "contact": {                  "phone": "555-5555",                  "fax": "111-1111"}}'::jsonb,
            '{contact,phone}',
            '"000-1234"'::jsonb));

             jsonb_pretty
   ---------------------------------
    {                              +
        "name": "Joan",            +
        "contact": {               +
            "fax": "111-1111",     +
            "phone": "000-1234"    +
        }                          +
    }
   (1row)

Again, in this example, the value of the nested phone key is changed within contact by the value given in the 3rd argument of the jsonb_set function. The only difference is that, as we have used it together with the jsonb_pretty function, the output is shown in a more clear and readable way.

Conclusion

Contrary to what the momentary hype on “NoSQL” databases is trying to show, a non-relational database cannot be seen as a “one size fits all” solution and, certainly, won’t be everyone’s favourite cup of tea.

Because of this, when talking about “NoSQL” databases, one thing to keep in mind is if a document database will fit your use case better than a relational one. If you conclude that this is the case, a great advantage is brought by the PostgreSQL JSONB features: you can have both options (a document and a relational database) delivered by the same solution, avoiding all the complexity that using different products would bring.

↧

Nikolay Shaplov: postgres: setting reloptions to TOAST that does not exist

May 30, 2016, 7:37 am

≫ Next: Tomas Vondra: Application users vs. Row Level Security

≪ Previous: Rubens Souza: JSONB and PostgreSQL 9.5: with even more powerful tools!

When you are creating table in postgres, you are creating up to two relations in a row.
In case when you create table with fixed-length attributes only, only one relation is created. A heap relation.
If you have at least one variable-length attribute in your table, then both heap and toast relations will be created.

Relations also have options: reloptions. You can set them while creating and altering table. To set options for toast relations you should use toast. prefix before reloption name:

CREATE TABLE reloptions_test (s varchar) WITH (toast.autovacuum_vacuum_cost_delay = 23 );

The only problem is that if you have table with no varlen values, postgres will accept toast reloption, but will not write it anywhere.

#CREATE TABLE reloptions_test (i int) WITH (toast.autovacuum_vacuum_cost_delay = 23 );
CREATE TABLE
# select reltoastrelid from pg_class where oid = 'reloptions_test'::regclass;
 reltoastrelid 
---------------
             0
(1 row)

there is no toast relation and reloption is not saved at all, postgres reports, everything is ok

Same for alter table:

# ALTER TABLE reloptions_test SET (toast.autovacuum_vacuum_cost_delay = 24 );
ALTER TABLE
# select reltoastrelid from pg_class where oid = 'reloptions_test'::regclass;
 reltoastrelid 
---------------
             0
(1 row)

This is not nice behavior, isn't it?

PS please when writing a comment, login with any account you have, or just leave a name and/or e-mail so I will be able to answer that comment ;-)

↧

Tomas Vondra: Application users vs. Row Level Security

May 30, 2016, 2:34 pm

≫ Next: Leo Hsu and Regina Obe: PostgreSQL 9.6 phrase text searching how far apart can you go

≪ Previous: Nikolay Shaplov: postgres: setting reloptions to TOAST that does not exist

A few days ago I’ve blogged about the common issues with roles and privileges we discover during security reviews.

Of course, PostgreSQL offers many advanced security-related features, one of them being Row Level Security (RLS), available since PostgreSQL 9.5.

As 9.5 was released in January 2016 (so just a few months ago), RLS is fairly new feature and we’re not really dealing with many production deployments yet. Instead RLS is a common subject of “how to implement” discussions, and one of the most common questions is how to make it work with application-level users. So let’s see what possible solutions there are.

Introduction to RLS

Let’s see a very simple example first, explaining what RLS is about. Let’s say we have a chat table storing messages sent between users – the users can insert rows into it to send messages to other users, and query it to see messages sent to them by other users. So the table might look like this:

CREATE TABLE chat (
    message_uuid    UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    message_time    TIMESTAMP NOT NULL DEFAULT now(),
    message_from    NAME      NOT NULL DEFAULT current_user,
    message_to      NAME      NOT NULL,
    message_subject VARCHAR(64) NOT NULL,
    message_body    TEXT
);

The classic role-based security only allows us to restrict access to either the whole table or vertical slices of it (columns). So we can’t use it to prevent users from reading messages intended for other users, or sending messages with a fake message_from field.

And that’s exactly what RLS is for – it allows you to create rules (policies) restricting access to subsets of rows. So for example you can do this:

CREATE POLICY chat_policy ON chat
    USING ((message_to = current_user) OR (message_from = current_user))
    WITH CHECK (message_from = current_user)

This policy ensures a user can only see messages sent by him or intended for him – that’s what the condition in USING clause does. The second part of the policy (WITH CHECK) assures a user can only insert messages with his username in message_from column, preventing messages with forged sender.

You can also imagine RLS as an automatic way to append additional WHERE conditions. You could do that manually at the application level (and before RLS people often did that), but RLS does that in a reliable and safe way (a lot of effort was put into preventing various information leaks, for example).

Note: Before RLS, a popular way to achieve something similar was to make the table inaccessible directly (revoke all the privileges), and provide a set of security definer functions to access it. That achieved mostly the same goal, but functions have various disadvantages – they tend to confuse the optimizer, and seriously limit the flexibility (if the user needs to do something and there’s no suitable function for it, he’s out of luck). And of course, you have to write those functions.

Application users

If you read the official documentation about RLS, you may notice one detail – all the examples use current_user, i.e. the current database user. But that’s not how most database applications work these days. Web applications with many registered users don’t maintain 1:1 mapping to database users, but instead use a single database user to run queries and manage application users on their own – perhaps in a users table.

Technically it’s not a problem to create many database users in PostgreSQL. The database should handle that without any problems, but applications don’t do that for a number of practical reasons. For example they need to track additional information for each user (e.g. department, position within the organization, contact details, …), so the application would need the users table anyway.

Another reason may be connection pooling – using a single shared user account, although we know that’s solvable using inheritance and SET ROLE (see the previous post).

But let’s assume you don’t want to create separate database users – you want to keep using a single shared database account, and use RLS with application users. How to do that?

Session variables

Essentially what we need is to pass additional context to the database session, so that we can later use it from the security policy (instead of the current_user variable). And the easiest way to do that in PostgreSQL are session variables:

SET my.username = 'tomas'

If this resembles the usual configuration parameters (e.g. SET work_mem = '...'), you’re absolutely right – it’s mostly the same thing. The command defines a new namespace (my), and adds a username variable into it. The new namespace is required, as the global one is reserved for the server configuration and we can’t add new variables to it. This allows us to change the security policy like this:

CREATE POLICY chat_policy ON chat
    USING (current_setting('my.username') IN (message_from, message_to))
    WITH CHECK (message_from = current_setting('my.username'))

All we need to do is to make sure the connection pool / application sets the user name whenever it gets a new connection and assigns it to the user task.

Let me point out that this approach collapses once you allow the users to run arbitrary SQL on the connection, or if the user manages to discover a suitable SQL injection vulnerability. In that case there’s nothing that could stop them from setting arbitrary username. But don’t despair, there’s a bunch of solutions to that problem, and we’ll quickly go through them.

Signed session variables

The first solution is a simple improvement of the session variables – we can’t really prevent the users from setting arbitrary value, but what if we could verify that the value was not subverted? That’s fairly easy to do using a simple digital signature. Instead of just storing the username, the trusted part (connection pool, application) can do something like this:

signature = sha256(username + timestamp + SECRET)

and then store both the value and the signature into the session variable:

SET my.username = 'username:timestamp:signature'

Assuming the user does not know the SECRET string (e.g. 128B of random data), it shouldn’t be possible to modify the value without invalidating the signature.

Note: This is not a new idea – it’s essentially the same thing as signed HTTP cookies. Django has a quite nice documentation about that.

The easiest way to protect the SECRET value is by storing it in a table inaccessible by the user, and providing a security definer function, requiring a password (so that the user can’t simply sign arbitrary values).

CREATE FUNCTION set_username(uname TEXT, pwd TEXT) RETURNS text AS $$
DECLARE
    v_key   TEXT;
    v_value TEXT;
BEGIN
    SELECT sign_key INTO v_key FROM secrets;
    v_value := uname || ':' || extract(epoch from now())::int;
    v_value := v_value || ':' || crypt(v_value || ':' || v_key,
                                       gen_salt('bf'));
    PERFORM set_config('my.username', v_value, false);
    RETURN v_value;
END;
$$ LANGUAGE plpgsql SECURITY DEFINER STABLE;

The function simply looks up the signing key (secret) in a table, computes the signature and then sets the value into the session variable. It also returns the value, mostly for convenience.

So the trusted part can do this right before handing the connection to the user (obviously ‘passphrase’ is not a very good password for production):

SELECT set_username('tomas', 'passphrase')

And then of course we need another function that simply verifies the signature and either errors out or returns the username if the signature matches.

CREATE FUNCTION get_username() RETURNS text AS $$
DECLARE
    v_key   TEXT;
    v_parts TEXT[];
    v_uname TEXT;
    v_value TEXT;
    v_timestamp INT;
    v_signature TEXT;
BEGIN

    -- no password verification this time
    SELECT sign_key INTO v_key FROM secrets;

    v_parts := regexp_split_to_array(current_setting('my.username', true), ':');
    v_uname := v_parts[1];
    v_timestamp := v_parts[2];
    v_signature := v_parts[3];

    v_value := v_uname || ':' || v_timestamp || ':' || v_key;
    IF v_signature = crypt(v_value, v_signature) THEN
        RETURN v_uname;
    END IF;

    RAISE EXCEPTION 'invalid username / timestamp';
END;
$$ LANGUAGE plpgsql SECURITY DEFINER STABLE;

And as this function does not need the passphrase, the user can simply do this:

SELECT get_username()

But the get_username() function is meant for security policies, e.g. like this:

CREATE POLICY chat_policy ON chat
    USING (get_username() IN (message_from, message_to))
    WITH CHECK (message_from = get_username())

A more complete example, packed as a simple extension, may be found here.

Notice all the objects (table and functions) are owned by a privileged user, not the user accessing the database. The user only has EXECUTE privilege on the functions, that are however defined as SECURITY DEFINER. That’s what makes this scheme works while protecting the secret from the user. The functions are defined as STABLE, to limit the number of calls to the crypt() function (which is intentionally expensive to prevent bruteforcing).

The example functions definitely need more work. But hopefully it’s good enough for a proof of concept demonstrating how to store additional context in a protected session variable.

What needs to be fixed you ask? Firstly the functions don’t handle various error conditions very nicely. Secondly, while the signed value includes a timestamp, we’re not really doing anything with it – it may be used to expire the value, for example. It’s possible to add additional bits into the value, e.g. a department of the user, or even information about the session (e.g. PID of the backend process to prevent reusing the same value on other connections).

Crypto

The two functions rely on cryptography – we’re not using much except some simple hashing functions, but it’s still a simple crypto scheme. And everyone knows you should not do your own crypto. Which is why I used the pgcrypto extension, particularly the crypt() function, to get around this problem. But I’m not a cryptographer, so while I believe the whole scheme is fine, maybe I’m missing something – let me know if you spot something.

Also, the signing would be a great match for public-key cryptography – we could use a regular PGP key with a passphrase for the signing, and the public part for signature verification. Sadly although pgcrypto supports PGP for encryption, it does not support the signing.

Alternative approaches

Of course, there are various alternative solutions. For example instead of storing the signing secret in a table, you may hard-code it into the function (but then you need to make sure the user can’t see the source code). Or you may do the signing in a C function, in which case it’s hidden from everyone who does not have access to memory (in which case you lost anyway).

Also, if you don’t like the signing approach at all, you may replace the signed variable with a more traditional “vault” solution. We need a way to store the data, but we need to make sure the user can’t see or modify the contents arbitrarily, except in a defined way. But hey, that’s what regular tables with an API implemented using security definer functions can do!

I’m not going to present the whole reworked example here (check this extension for a complete example), but what we need is a sessions table acting as the vault:

CREATE TABLE sessions (
    session_id    UUID PRIMARY KEY,
    session_user  NAME NOT NULL
)

The table must not be accessible by regular database users – a simple REVOKE ALL FROM ... should take care of that. And then an API consisting of two main functions:

set_username(user_name, passphrase)– generates a random UUID, inserts data into the vault and stores the UUID into a session variable
get_username()– reads the UUID from a session variable and looks-up the row in the table (errors if no matching row)

This approach replaces the signature protection with randomness of the UUID – the user may tweak the session variable, but the probability of hitting an existing ID is negligible (UUIDs are 128-bit random values).

It’s a bit more traditional approach, relying on traditional role-based security, but it also has a few disadvantages – for example it actually does database writes, which means it’s inherently incompatible with hot standby systems.

Getting rid of the passphrase

It’s also possible to design the vault so that the passphrase is not necessary. We have introduced it because we assumed set_username happens on the same connection – we have to keep the function executable (so messing with roles or privileges is not a solution), and the passphrase ensures only the trusted component can actually use it.

But what if the signing / session creation happens on a separate connection, and only the result (signed value or session UUID) is copied into the connection handed to the user? Well, then we don’t need the passphrase any more. (It’s a bit similar to what Kerberos does – generating a ticket on a trusted connection, then use the ticket for other services.)

Summary

So let me quickly recap this blog post:

While all the RLS examples use database users (by means of current_user), it’s not very difficult to make RLS work with application users.
Session variables are a reliable and quite simple solution, assuming the system has a trusted component that can set the variable before handing the connection to a user.
When the user can execute arbitrary SQL (either by design or thanks to a vulnerability), a signed variable prevents the user from changing the value.
Other solutions are possible, e.g. replacing the session variables with table storing info about sessions identified by random UUID.
A nice thing is the session variables do no database writes, so this approach can work on read-only systems (e.g. hot standby).

In the next part of this blog series we’ll look at using application users when the system does not have a trusted component (so it can’t set the session variable or create a row in the sessions table), or when we want to perform (additional) custom authentication within the database.

↧

Leo Hsu and Regina Obe: PostgreSQL 9.6 phrase text searching how far apart can you go

May 30, 2016, 3:48 pm

≫ Next: Michael Paquier: Postgres 9.6 feature highlight: Non-exclusive base backups

≪ Previous: Tomas Vondra: Application users vs. Row Level Security

We've been playing around with new phrase text feature of full-text PostgreSQL 9.6. In doing so, I was curious how big a number one can designate as max that words can be apart. I discovered thru trial and error, that the magic number is 16384 which is much bigger than I had suspected.

Continue reading "PostgreSQL 9.6 phrase text searching how far apart can you go"

↧

Michael Paquier: Postgres 9.6 feature highlight: Non-exclusive base backups

May 30, 2016, 6:55 pm

≫ Next: Andrew Dunstan: Indiscriminate use of CTEs considered harmful

≪ Previous: Leo Hsu and Regina Obe: PostgreSQL 9.6 phrase text searching how far apart can you go

pg_start_backup and pg_stop_backup, the two low-level functions of PostgreSQL that can be used to take a base backup from an instance, have been extended with a new option allowing to take what is called non-exclusive backups. This feature is introduced in PostgreSQL 9.6 by the following commit:

commit: 7117685461af50f50c03f43e6a622284c8d54694
author: Magnus Hagander <magnus@hagander.net>
date: Tue, 5 Apr 2016 20:03:49 +0200
Implement backup API functions for non-exclusive backups

Previously non-exclusive backups had to be done using the replication protocol
and pg_basebackup. With this commit it's now possible to make them using
pg_start_backup/pg_stop_backup as well, as long as the backup program can
maintain a persistent connection to the database.

Doing this, backup_label and tablespace_map are returned as results from
pg_stop_backup() instead of being written to the data directory. This makes
the server safe from a crash during an ongoing backup, which can be a problem
with exclusive backups.

The old syntax of the functions remain and work exactly as before, but since the
new syntax is safer this should eventually be deprecated and removed.

Only reference documentation is included. The main section on backup still needs
to be rewritten to cover this, but since that is already scheduled for a separate
large rewrite, it's not included in this patch.

Reviewed by David Steele and Amit Kapila

The existing functions pg_start_backup and pg_stop_backup that are present for ages in Postgres have a couple of limitations that have always been disturbing for some users:

It is not possible to take multiple backups in parallel.
In case of a crash of the tool taking the backup, the server remains stuck in backup mode and needs some cleanup actions.
The backup_label file being created in the data folder, it is not possible to make the difference between a server that crashed while a backup is taken and a cluster restored from a backup.

Some users are able to live with those problems, the application layer in charge of the backups can take up extra cleanup actions in case of a backup tool crash letting the cluster in an inconsistent state, or has a design that assumes that no more than one backup can be taken.

Non-exclusive backups work in such a way that the backup_label file and the tablespace map file are not created in the data folder but are returned as results of pg_stop_backup. In this case the backup tool is the one in charge of writing both files in the backup taken. This has the advantage to leverage all the problems that exclusive backups induce, at the cost of a couple of things though:

The backup utility is in charge of doing some extra work to put the resulting base backup in a consistent state.
The connection to the backend needs to remain while the base backup is being taken. If the client disconnects while the backup is taken, it is aborted.

So, in order to control that, a third argument has been added to pg_start_backup. Its default value is true, meaning that an exclusive backup is taken, protecting all the existing backup tools:

=# SELECT pg_start_backup('my_backup', true, false);
 pg_start_backup
-----------------
 0/4000028
(1 row)

Note also that pg_stop_backup uses now an extra argument to track if it needs to stop an exclusive or a non-exclusive backup. With the backup started previously, trying to stop an exclusive backup results in an error:

=# SELECT pg_stop_backup(true);
ERROR:  55000: non-exclusive backup in progress
HINT:  did you mean to use pg_stop_backup('f')?
LOCATION:  pg_stop_backup_v2, xlogfuncs.c:230

Then let’s stop it correctly, and the resulting fields are what is needed to complete the backup.:

=# SELECT * FROM pg_stop_backup(false);
NOTICE:  00000: pg_stop_backup complete, all required WAL segments have been archived
LOCATION:  do_pg_stop_backup, xlog.c:10569
    lsn    |                           labelfile                           | spcmapfile
-----------+---------------------------------------------------------------+------------
 0/4000130 | START WAL LOCATION: 0/4000028 (file 000000010000000000000004)+|
           | CHECKPOINT LOCATION: 0/4000060                               +|
           | BACKUP METHOD: streamed                                      +|
           | BACKUP FROM: master                                          +|
           | START TIME: 2016-05-31 10:34:46 JST                          +|
           | LABEL: my_backup                                             +|
           |                                                               |
(1 row)

The contents of “labelfile” need to be written as backup_label in the backup taken while the contents of “spcmapfile” need to be written to tablespace_map. Once the contents of those files is written, don’t forget as well to flush them to disk to prevent any potential loss caused by power failures for example.

↧

Andrew Dunstan: Indiscriminate use of CTEs considered harmful

May 31, 2016, 5:47 am

≫ Next: Bruce Momjian: Lots-O-Travel

≪ Previous: Michael Paquier: Postgres 9.6 feature highlight: Non-exclusive base backups

Common Table Expressions are a wonderful thing. Not only are they indespensible for creating recursive queries, but they can be a powerful tool in creating complex queries that are comprehensible. It's very easy to get lost in a fog of sub-sub-sub-queries, so using CTEs as a building block can make things a lot nicer.

However, there is one aspect of the current implementation of CTEs that should make you pause. Currently CTEs are in effect materialized before they can be used. That is, Postgres runs the query and stashes the data in a temporary store before it can be used in the larger query. There are a number of consequences of this.

First, this can be a good thing. I have on a number of occasions used this fact to good effect to get around problems with poorly performing query plans. It's more or less the same effect as putting "offset 0" on a subquery.

However, it can also result in some very inefficient plans. In particular, if CTEs return a lot of rows they can result in some spectacularly poorly performing plans at the point where you come to use them. Note also that any indexes on the underlying tables will be of no help to you at all at this point, since you are no longer querying against those tables but against the intermediate result mentioned above, which has no indexes at all.

This was brought home to me forcefully on Friday and Saturday when I was looking at a very poorly performing query. After some analysis and testing, the simple act of inlining two CTEs in the query in question resulted in the query running in 4% of the time it had previously taken. Indiscriminate use of CTEs had made the performance of this query 25 times worse.

So the moral is: be careful in using CTEs. They are not just a convenient tool for abstracting away subqueries.

There has been some discussion about removing this aspect of the implementation of CTEs. It's not something that is inherent in CTEs, it's simply a part of the way we have implemented them in PostgreSQL. However, for now, you need to be familiar with the optimization effects when using them, or you might face the same problem I was dealing with above.

↧

Bruce Momjian: Lots-O-Travel

May 31, 2016, 10:00 am

≫ Next: REGINA OBE: FOSS4GNA 2016 PostGIS Spatial Tricks video is out

≪ Previous: Andrew Dunstan: Indiscriminate use of CTEs considered harmful

Since January, I have had the pleasure of speaking about Postgres in 15 cities: Singapore, Seoul, Tokyo, San Francisco, Los Angeles, Phoenix, St. Louis, Bloomington (Illinois), Chicago, Charlotte, New York City, Brussels, Helsinki, Moscow, and Krasnoyarsk (Siberia).

I am particularly excited about the new cities I visited in Asia, and growth there will continue in the coming months. You can see from my travel map that the two areas still lacking Postgres activity are the Middle East and Africa. Fortunately, Umair Shahid has already started on the Middle East.

In more Postgres-saturated continents, like North America and Europe, there are now several conferences per year during different months, in different cities, and with different focuses. While proprietary database companies usually have just one huge conference a year per continent, our distributed conference teams allow for smaller, more frequent, more geographically distributed conferences, which better meet the needs of our users. Smaller conferences allow for more interaction with speakers and leaders. More frequent conferences allow people to attend a conference quickly, rather than waiting eleven months for the next yearly conference. Geographically distributed conferences allow for reduced travel costs, which is particularly important for first-time attendees. I realize seeing thousands of Postgres people together is motivating, but once that wears off, the benefits of more, smaller conferences are hard to beat.

↧

REGINA OBE: FOSS4GNA 2016 PostGIS Spatial Tricks video is out

May 31, 2016, 1:36 pm

≫ Next: Greg Sabino Mullane: Bucardo replication workarounds for extremely large Postgres updates

≪ Previous: Bruce Momjian: Lots-O-Travel

The videos for FOSS4G NA 2016 have started coming out. Recently Andrea Ross posted PostGIS Spatial Tricks talk video. I'm happy to say it looks pretty good and I didn't suck as badly as I worried I would. Thank you very much Andrea. Some talks unfortunately did not come thru. I'm hoping Leo's pgRouting : a Crash Course video made it thru okay as well, and will post that later if it does.

Only small little nit-picks is the first 2-5 minutes or so didn't make it thru and the blue colors on the slides got a little drowned out, but here are the slides if you need full resolution.

↧

Greg Sabino Mullane: Bucardo replication workarounds for extremely large Postgres updates

May 31, 2016, 8:59 pm

≫ Next: Stefan Petrea: Redesigning a notification system

≪ Previous: REGINA OBE: FOSS4GNA 2016 PostGIS Spatial Tricks video is out

(photograph by Kevin Dooley)

Bucardo is very good at replicating data among Postgres databases (as well as replicating to other things, such as MariaDB, Oracle, and Redis!). However, sometimes you need to work outside the normal flow of trigger-based replication systems such as Bucardo. One such scenario is when many changes need to be made to your replicated tables. And by a lot, I mean many millions of rows. When this happens, it may be faster and easier to find an alternate way to replicate those changes.

When a change is made to a table that is being replicated by Bucardo, a trigger fires and stores the primary key of the row that was changed into a "delta" table. Then the Bucardo daemon comes along, gathers a list of all rows that were changed since the last time it checked, and pushes those rows to the other databases in the sync (a named replication set). Although all of this is done in a fast and efficient manner, there is a bit of overhead that adds up when (for example), updating 650 million rows in one transaction.

The first and best solution is to simply hand-apply all the changes yourself to every database you are replicating to. By disabling the Bucardo triggers first, you can prevent Bucardo from even knowing, or caring, that the changes have been made.

To demonstrate this, let's have Bucardo replicate among five pgbench databases, called A, B, C, D, and E. Databases A, B, and C will be sources; D and E are just targets. Our replication looks like this: ( A <=> B <=> C ) => (D, E). First, we create all the databases and populate them:

## Create a new cluster for this test, and use port 5950 to minimize impact$ initdb --data-checksums btest
$ echo port=5950 >> btest/postgresql.conf
$ pg_ctl start -D btest -l logfile

## Create the main database and install the pg_bench schema into it$ export PGPORT=5950
$ createdb alpha
$ pgbench alpha -i --foreign-keys

## Replicated tables need a primary key, so we need to modify things a little:$ psql alpha -c 'alter table pgbench_history add column hid serial primary key'

## Create the other four databases as exact copies of the first one:$ for dbname in beta gamma delta epsilon; do createdb $dbname -T alpha; done

Now that those are done, let's install Bucardo, teach it about these databases, and create a sync to replicate among them as described above.

$ bucardo install --batch
$ bucardo add dbs A,B,C,D,E dbname=alpha,beta,gamma,delta,epsilon dbport=5950
$ bucardo add sync fiveway tables=all dbs=A:source,B:source,C:source,D:target,E:target

## Tweak a few default locations to make our tests easier:$ echo -e "logdest=.\npiddir=." > .bucardorc

At this point, we have five databases all ready to go, and Bucardo is setup to replicate among them. Let's do a quick test to make sure everything is working as it should.

$ bucardo start
Checking for existing processes
Starting Bucardo$ for db in alpha beta gamma delta epsilon; do psql $db -Atc "select '$db',sum(abalance) from pgbench_accounts";done | tr "\n" " "
alpha|0 beta|0 gamma|0 delta|0 epsilon|0$ pgbench alpha
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 10
number of transactions actually processed: 10/10
latency average: 0.000 ms
tps = 60.847066 (including connections establishing)
tps = 62.877481 (excluding connections establishing)$ for db in alpha beta gamma delta epsilon; do psql $db -Atc "select '$db',sum(abalance) from pgbench_accounts";done | tr "\n" " "
alpha|6576 beta|6576 gamma|6576 delta|6576 epsilon|6576$ pgbench beta
starting vacuum...end.
...
tps = 60.728681 (including connections establishing)
tps = 62.689074 (excluding connections establishing)$ for db in alpha beta gamma delta epsilon; do psql $db -Atc "select '$db',sum(abalance) from pgbench_accounts";done | tr "\n" " "
alpha|7065 beta|7065 gamma|7065 delta|7065 epsilon|7065

Let's imagine that the bank discovered a huge financial error, and needed to increase the balance of every account created in the last two years by 20 dollars. Let's further imagine that this involved 650 million customers. That UPDATE will take a very long time, but will suffer even more because each update will also fire a Bucardo trigger, which in turn will write to another "delta" table. Then, Bucardo will have to read in 650 million rows from the delta table, and (on every other database in the sync) apply those changes by deleting 650 million rows then COPYing over the correct values. This is one situation where you want to sidestep your replication and handle things yourself. There are three solutions to this. The easiest, as mentioned, is to simply do all the changes yourself and prevent Bucardo from worrying about it.

The basic plan is to apply the updates on all the databases in the syncs at once, while using the session_replication_role feature to prevent the triggers from firing. Of course, this will prevent *all* of the triggers on the table from firing. If there are some non-Bucardo triggers that must fire during this update, you might wish to temporarily set them as ALWAYS triggers.

Solution one: manual copy

## First, stop Bucardo. Although not necessary, the databases are going to be busy enough
## that we don't need to worry about Bucardo at the moment.$ bucardo stop
Creating ./fullstopbucardo ... Done## In real-life, this query should get run in parallel across all databases,
## which would be on different servers:$ QUERY='UPDATE pgbench_accounts SET abalance = abalance + 25 WHERE aid > 78657769;'

$ for db in alpha beta gamma delta epsilon; do psql $db -Atc "SET session_replication_role='replica'; $QUERY"; done | tr "\n" " "
UPDATE 83848570 UPDATE 83848570 UPDATE 83848570 UPDATE 83848570 UPDATE 83848570 ## For good measure, confirm Bucardo did not try to replicate all those rows:$ bucardo kick fiveway
Kicked sync fiveway$ grep Totals log.bucardo
(11144) [Mon May 16 23:08:57 2016] KID (fiveway) Totals: deletes=36 inserts=28 conflicts=0
(11144) [Mon May 16 23:09:02 2016] KID (fiveway) Totals: deletes=38 inserts=29 conflicts=0
(11144) [Mon May 16 23:09:22 2016] KID (fiveway) Totals: deletes=34 inserts=27 conflicts=0
(11144) [Tue May 16 23:15:08 2016] KID (fiveway) Totals: deletes=10 inserts=7 conflicts=0
(11144) [Tue May 16 23:59:00 2016] KID (fiveway) Totals: deletes=126 inserts=73 conflicts=0

Solution two: truncate the delta

As a second solution, what about the event involving a junior DBA who made all those updates on one of the source databases without disabling triggers? When this happens, you would probably find that your databases are all backed up and waiting for Bucardo to handle the giant replication job. If the rows that have changed constitute most of the total rows in the table, your best bet is to simply copy the entire table. You will also need to stop the Bucardo daemon, and prevent it from trying to replicate those rows when it starts up by cleaning out the delta table. As a first step, stop the main Bucardo daemon, and then forcibly stop any active Bucardo processes:

$ bucardo stop
Creating ./fullstopbucardo ... Done$ pkill -15 Bucardo

Now to clean out the delta table. In this example, the junior DBA updated the "beta" database, so we look there. We may go ahead and truncate it because we are going to copy the entire table after that point.

# The delta tables follow a simple format. Make sure it is the correct one
$ psql beta -Atc 'select count(*) from bucardo.delta_public_pgbench_accounts'
650000000## Yes, this must be the one!## Truncates are dangerous; be extra careful from this point forward$ psql beta -Atc 'truncate table bucardo.delta_public_pgbench_accounts'

The delta table will continue to accumulate changes as applications update the table, but that is okay - we got rid of the 650 million rows. Now we know that beta has the canonical information, and we need to get it to all the others. As before, we use session_replication_role. However, we also need to ensure that nobody else will try to add rows before our COPY gets in there, so if you have active source databases, pause your applications. Or simply shut them out for a while via pg_hba.conf! Once that is done, we can copy the data until all databases are identical to "beta":

$ ( echo "SET session_replication_role='replica'; TRUNCATE TABLE pgbench_accounts; " ; pg_dump beta --section=data -t pgbench_accounts ) | psql alpha -1 --set ON_ERROR_STOP=on
SET
ERROR:  cannot truncate a table referenced in a foreign key constraint
DETAIL:  Table "pgbench_history" references "pgbench_accounts".
HINT:  Truncate table "pgbench_history" at the same time, or use TRUNCATE ... CASCADE.

Aha! Note that we used the --foreign-keys option when creating the pgbench tables above. We will need to remove the foreign key, or simply copy both tables together. Let's do the latter:

$ ( echo "SET session_replication_role='replica'; TRUNCATE TABLE pgbench_accounts, pgbench_history; " ; pg_dump beta --section=data -t pgbench_accounts \
  -t pgbench_history) | psql alpha -1 --set ON_ERROR_STOP=on
SET
TRUNCATE TABLE
SET
SET
SET
SET
SET
SET
SET
SET
COPY 100000
COPY 10
 setval 
--------
     30
(1 row)## Do the same for the other databases:$ for db in gamma delta epsilon; do \
 ( echo "SET session_replication_role='replica'; TRUNCATE TABLE pgbench_accounts, pgbench_history; " ; pg_dump $db --section=data -t pgbench_accounts \
  -t pgbench_history) | psql alpha -1 --set ON_ERROR_STOP=on ; done

Note: if your tables have a lot of constraints or indexes, you may want to disable those to speed up the COPY. Or even turn fsync off. But that's the topic of another post.

Solution three: delta excision

Our final solution is a variant on the last one. As before, the junior DBA has done a mass update of one of the databases involved in the Bucardo sync. But this time, you decide it should be easier to simply remove the deltas and apply the changes manually. As before, we shut down Bucardo. Then we determine the timestamp of the mass change by checking the delta table closely:

$ psql beta -Atc 'select txntime, count(*) from bucardo.delta_public_pgbench_accounts group by 1 order by 2 desc limit 3'
2016-05-26 23:23:27.252352-04|65826965
2016-05-26 23:23:22.460731-04|80
2016-05-07 23:20:46.325105-04|73
2016-05-26 23:23:33.501002-04|69

Now we want to carefully excise those deltas. With that many rows, it is quicker to save/truncate/copy than to do a delete:

$ psql beta
beta=# BEGIN;
BEGIN## To prevent anyone from firing the triggers that write to our delta table
beta=#LOCK TABLE pgbench_accounts;
LOCK TABLE## Copy all the delta rows we want to save:
beta=# CREATE TEMP TABLE bucardo_store_deltas AS SELECT * FROM bucardo.delta_public_pgbench_accounts WHERE txntime  '2016-05-07 23:20:46.325105-04';
SELECT 1885
beta=# TRUNCATE TABLE bucardo.delta_public_pgbench_accounts;
TRUNCATE TABLE## Repopulate the delta table with our saved edits
beta=# INSERT INTO bucardo.delta_public_pgbench_accounts SELECT * FROM bucardo_store_deltas;
INSERT 0 1885## This will remove the temp table
beta=# COMMIT;
COMMIT

Now that the deltas are removed, we want to emulate what caused them on all the other servers. Note that this query is a contrived one that may lend itself to concurrency issues. If you go this route, make sure your query will produce the exact same results on all the servers.

## As in the first solution above, this should ideally run in parallel$ QUERY='UPDATE pgbench_accounts SET abalance = abalance + 25 WHERE aid > 78657769;'

## Unlike before, we do NOT run this against beta$ for db in alpha gamma delta epsilon; do psql $db -Atc "SET session_replication_role='replica'; $QUERY"; done | tr "\n" " "
UPDATE 837265 UPDATE 837265 UPDATE 837265 UPDATE 837265 UPDATE 837265 ## Now we can start Bucardo up again$ bucardo start
Checking for existing processes
Starting Bucardo

That concludes the solutions for when you have to make a LOT of changes to your database. How do you know how much is enough to worry about the solutions presented here? Generally, you can simply let Bucardo run - you will know when everything crawls to a halt that perhaps trying to insert 465 million rows at once was a bad idea. :)

↧

Stefan Petrea: Redesigning a notification system

May 24, 2016, 9:00 pm

≫ Next: Simon Riggs: PgDay France 2016

≪ Previous: Greg Sabino Mullane: Bucardo replication workarounds for extremely large Postgres updates

Intro

As previously mentioned, UpStatsBot is part of UpStats. It's a notification system in the form of a bot. You can get notifications about jobs while you're on the street or in a cafe.

This bot has been running for ~7 months now. And it was running quite well, except for one thing, the Postgres logs were growing a lot. Not long ago, I've analyzed PostgreSQL logs for 1 week using pgbadger. The results were startling:

As the image shows, the bot was responsible for more than 40% of all queries to that database. In addition, the Pg logs were growing quite a lot.

The bot would run queries to poll for new items every 10 minutes; those queries would run regardless if the collector had brought in new data since the last time the queries were run.

This blog post will describe how I fixed that issue using PostgreSQL's LISTEN/NOTIFY feature ¹^,².

The main advantage of LISTEN/NOTIFY is the ability to receive near-realtime notifications with a fraction of the queries.

Overview of the UpStats project

The diagram above summarizes how UpStats and UpStatsBot work and how the system is composed:

A data collector
A PostgreSQL db
A web app
Metrics (that are recomputed periodically)
A notification system

A user can access it from the telegram bot or from the web app. The data is collected from the UpWork API, placed in a PostgreSQL database. Metrics can be computed(via more complex SQL queries) and notifications dispatched using the Telegram API.

We aim to move the logic responsible for notifications computation from the bot into the collector in order to realize near real-time dispatch (i.e. whenever new data becomes available).

Tables involved in notifications

The relevant tables here are:

odesk_job
odesk_search_job
odesk_telegram

In summary, we use search keywords from odesk_telegram and search for them in odesk_job via odesk_search_job. The odesk_search_job table holds full-text indexes for the jobs. The odesk_telegram table holds search keywords and active search streams for each subscribed user.

PosgreSQL's LISTEN/NOTIFY in Python

To offer some background, some use-cases of LISTEN/NOTIFY include:

using it in conjunction with websockets to build chat systems ³
building asynchronous and trigger-based replication systems ⁴
keeping caches in sync with a PostgreSQL database ⁵

We're using the psycopg2 connector ⁶ for PostgreSQL. The connector uses a socket to talk to the Postgres database server, and that socket has a file descriptor. That descriptor is used in the select call. Select checks if the descriptor is ready for reading ⁷ .

In order to exemplify this, we'll write a simple Python class that allows to send and listen to notifications ⁸ .

import select
import psycopg2
import psycopg2.extensions
from psycopg2.extensions import QuotedString
import json
import time

__doc__="""This class is used to create an easy to use queuemechanism where you can send and listen to messages."""classPgQueue:
    dbuser = Nonedbpass = Nonedbname = Noneconn = Nonecurs = Nonechannel = Nonecontinue_recv = Truedef__init__(self,channel,dbname=None,dbuser=None,dbpass=None):
        """        Connect to the database.        If one of dbname, dbuser or dbpassword are not provided,        the responsibility of providing (and setting a connection on        this object) will fall on the calling code.         Otherwise, this will create a connection to the database.        """self.dbname = dbname
        self.dbuser = dbuser
        self.dbpass = dbpass
        self.channel = channel

        ifnot channel:
            raiseException('No channel provided')

        if dbname and dbuser and dbpass:
            # store connectionself.conn = psycopg2.connect( \
                    'dbname={dbname} user={dbuser} password={dbpass} host=127.0.0.1'.format(\
                    dbname=dbname,dbuser=dbuser,dbpass=dbpass))
            # this is required mostly by the NOTIFY statement because it has# to commit after the query has been executedself.conn.set_isolation_level(psycopg2.extensions.ISOLATION_LEVEL_AUTOCOMMIT)

    defrecvLoop(self):
        """        Loop that's concerned with receiving notifications        """self.curs = self.conn.cursor()
        self.curs.execute("LISTEN {0};".format(self.channel))

        conn = self.conn
        curs = self.curs

        whileself.continue_recv:
            if select.select([conn],[],[],6) == ([],[],[]):
                print"consumer: timeout"else:
                conn.poll()
                print"consumer: received messages"while conn.notifies:
                    notif = conn.notifies.pop(0)
                    # print "Got NOTIFY:", notif.pid, notif.channel, notif.payloadself.recvCallback(notif)

    defrecvCallback(self, notification):
        """        Needs to be implemented with notification handling logic        """passdefsend(self, data):
        """        Send a notification        """curs = self.conn.cursor()

        message = {}
        print"producer: sending.."# equip the message object with a timestampmessage['time'] = time.time()
        message['data'] = data
        messageJson = json.dumps(message)
        messagePg = QuotedString(messageJson).getquoted()

        query = 'NOTIFY {0}, {1};'.format(self.channel, messagePg )
        print query
        curs.execute(query)

Now that we've implemented the class we can use it. The producer will be quite simple, and the consumer will need to either patch the notifyCallback method or subclass the PgQueue class to override the same method. We'll use the former, we'll patch the method. We'll run the producer in a thread and the consumer in a different thread.

defsample_producer_thread():
    q = PgQueue('botchan', dbname='dbname', dbuser='username', dbpass='password')

    while(True):
        time.sleep(0.4)
        message = {}
        message['test'] = "value"
        q.send(message)

defsample_consumer_thread():
    q = PgQueue('botchan', dbname='dbname', dbuser='username', dbpass='password')

    defnewCallback(m):
        if m.payload:
            payload = m.payload
            print"callback: ", payload

    # replace the receiver callbackq.recvCallback = newCallback
    q.recvLoop()

if__name__ == '__main__':
    import signal 
    from threading import Thread

    thread_producer = Thread(target=sample_producer_thread)
    thread_consumer = Thread(target=sample_consumer_thread)
    thread_producer.start()
    thread_consumer.start()
    thread_producer.join()
    thread_consumer.join()

Putting together user notifications

The regexes below are creating tsquery-compatible strings. Then those strings are used to run full-text searches on the job table. This way we can build notifications for each user and for each of their active search streams.

The last_job_ts is used to make sure we limit our searches to the new data.

We make use of wCTE (WITH common table expressions) because they're easy to work with and allow for gradually refining results of previous queries until the desired data can be extracted.

Near the end of the query we neatly pack all the data using PostgreSQL's JSON functions.

WITH user_notifs AS (
    SELECT
    id,
    last_job_ts,
    search,
    chat_id,
    regexp_replace(
           LOWER(
               regexp_replace(
                   rtrim(ltrim(search,' '),' '),
                   '\s+',' ','g'
               )
           ),
        '(\s*,\s*|\s)' , ' & ', 'g'
    )
    AS fts_query
    FROM odesk_telegram
    WHERE paused = falseAND deleted = false
), jobs AS (
    SELECT A.job_id, A.tsv_basic, B.job_title, B.date_created
    FROM odesk_search_job A
    JOIN odesk_job B ON A.job_id = B.job_id
    WHERE B.date_created > EXTRACT(epoch FROM (NOW() - INTERVAL'6 HOURS'))::int
), newAS (
    SELECT
    A.id, A.chat_id, A.search, B.job_id, B.job_title, B.date_created
    FROM user_notifs AS A
    JOIN jobs B ON (
        B.tsv_basic @@ A.fts_query::tsquery AND
        B.date_created > A.last_job_ts
    )
), json_packed AS (
    SELECT
    A.id,
    A.search,
    A.chat_id,
    json_agg(
    json_build_object (
        'job_id', A.job_id,
        'job_title', A.job_title,
        'date_created', A.date_created
    )) AS j
    FROMnew A
    GROUPBY A.id, A.search, A.chat_id
)
SELECT * FROM json_packed;

Tightening the constraints

Near the end of the collector program we compute the notifications and an expensive search query needs to be run in order to find out what to send and whom to send it to.

However, every time this query has run, we can store the latest timestamp on that search stream so next time we can tighten the search constraints and next time we're trying to compute the notifications we only search after that timestamp.

In order to do this, the search streams' last_job_ts needs to be updated:

UPDATE odesk_telegram SET last_job_ts = %(new_ts)d WHERE id = %(id)d;

For the active search streams that had new jobs, the earliest timestamp can be passed as parameter to this query.

Even for the active search streams that have seen no new jobs, we still have to tighten the search by updating their last_job_ts to the time when the collector started (we can only go so far, any later than this and we might miss jobs that were posted while the collector was running).

Optimizing the search streams

If enough data and enough users are present, we could craft a better query for this. For example, the search keywords could be organized in a tree structure.

This particular tree would store keywords based on the number of search results they're present in; in other words, the more queries a keyword exists in, the closer to the root that keyword will be.

A search stream corresponds to a path in this tree.

For example, in the tree below, php, wordpress is a search stream and user3 has registered for that particular search stream. Accordingly, user3 will receive job ads that match the words php and wordpress. Given the logic described above, php will match more jobs than wordpress.⁹

A temporary table can be created for the high-volume top-level search streams. To get to the more specific search streams, a JOIN on this table followed by the conditions for the lower-volume keywords would be enough.

For example, there are two search streams php,wordpress (for user3) and php,mysql (for user2). We could cache the ids of the notifications for the larger stream php and then refine it in order to get the two streams we're interested in.

This would be particularly interesting for a situation with a large number of subscribed users and a lot of active search streams as the tree would expand, preferably in depth rather than breadth.

Conclusion

This blog post describes the redesign of a notification system and some ideas about improving its performance.

Footnotes:

Notifications in PostgreSQL have been around for a long time, but starting with version 9.0 they are equipped with a payload.

The message passing described in this blog post is a simple one.

There are situations in which message delivery is more critical than the method described here.

For that purpose, this article provides a better approach that handles delivery failures. It uses a queue table and two triggers. The queue table is used to persist the messages that are sent.

One trigger will be placed on the table whose changes are of interest.

The other trigger will be placed on the queue table. So the message will be the "cascading" result of modifying the actual table. The advantage here is persistence (among other advantages that you can read in that article).

What's more, for the first trigger, the function row_to_json offers a way to serialize the changes in a structure-agnostic way.

Here's another article describing a queue-table-centric approach (without any LISTEN/NOTIFY). There's an emphasis put on locking and updating the 'processed' state of each item in the queue table, and different approaches for that. This would be more in line with a distributed queue.

For example this presentation in which the author explains how Postgres sends notifications to a Python backend via LISTEN/NOTIFY, which are then forwarded to the browser via websockets. The presentation is also available on youtube here.

⁴

For example Bucardo

⁵

This article describes a scenario where a PostgreSQL database updates a cache by broadcasting changes to it.

⁶

Although the code here is in Python, you may certainly use PostgreSQL's notifications in other languages (it's a well-supported feature) including the following:

in Ruby, the pg.notifies and the pg.consume_input methods of the pg driver
in Java, using the pgjdbc-ng JDBC driver
in Perl, using the DBD::Pg driver and the pg_notifies method
in Go, using the lib/pq library
in PHP, using the PDO_PGSQL driver and its PDO::pgsqlGetNotify method or the pg_get_notify method. Also see this example
in Node.js, using the node-postgres client and its notification method

⁷

More details about the conn.poll() statement. The poll() method comes from psycopg2 (it's called conn_poll in the C code) and it reads all the notifications from PQsocket (the ones we have issued a LISTEN statement for). There are 5 functions involved:

conn_poll (which in turn calls _conn_poll_query)
_conn_poll_query (in turn, calls pq_is_busy)
pq_is_busy (which calls conn_notifies_process and conn_notice_process)
conn_notifies_process (reads them from the PQsocket and populates C data structures)
conn_notice_process (turns the available notifications into Python data structures)

⁸

Need to keep in mind that the payloads are limited to 8000 bytes.

⁹

This section about optimizing and caching search results is just an idea at this point. Details such as how to keep the tree updated, which search streams should be cached in the temporary table, and how to represent it are not yet worked out. It's not yet implemented, it will probably be used later on.

↧

Simon Riggs: PgDay France 2016

June 1, 2016, 3:38 am

≫ Next: Dinesh Kumar: pgBucket - A new concurrent job scheduler

≪ Previous: Stefan Petrea: Redesigning a notification system

31 May 2016 – http://pgday.fr/

PgDay France was a well attended conference with more than 140 attendees, with many presentations from PostgreSQL users further demonstrating just how popular PostgreSQL is now.

It was good to see a large well organized conference in Lille, the “capital of Northern France”. Lille is an industrial hub and university town, with an airport to give access to rest of Europe. Fast rail links with London, as well as Brussels and Paris, allowed me to attend easily. I’d not been there before, but I’ll be making sure to go back for a better visit.

As we should expect, the conference was in French language, a challenge for me I admit, but I was able to follow most of the presentations. Parlée lentement, je peux comprendre parfois.

I arrived just after lunch, to catch the last 5 presentations. The first of those was a talk about transformational change and technical migration from Oracle to PostgreSQL, which seems to be happening everywhere.

Vincent Picavet gave a good talk about PostGIS. Forgive me, but the best part of this talk was where he demonstrated an analogue clock written using PostgreSQL/PostGIS with the hands of the clock as geometry objects. Fantastic!

Next a speaker from Meteo France spoke about their very large meteorological databases. I was happy to hear that he thanked 2ndQuadrant Support, especially Cedric Villemain, who I know had worked hard to make their implementation successful. Interesting talk.

A good technical presentation about BRIN indexes in 9.5 by Adrien Nayrat showed me that 2ndQuadrant’s contributions are beginning to be understood and appreciated across a wide audience. I didn’t have anything to add, though Vincent reminded me over a coffee that the recent additions that allow BRIN indexes to work with PostGIS (by Giuseppe Broccolo) were very important. We spoke a little about future developments in that area.

Last up, Cedric Villemain’s talk about Logical Replication and pglogical raised many questions which a group of us discussed over dinner later than evening. Lots of interest there, but still many misunderstandings we can correct over time.

http://2ndquadrant.com/fr/resources/pglogical/ and http://2ndquadrant.com/en/resources/pglogical/

The conference ended with a pleasant soiree to celebrate the 20th anniversary of the PostgreSQL project. Thanks very much to the French PostgreSQL Association for the invitation to dinner afterwards. Great event, merci beaucoup.

↧

Dinesh Kumar: pgBucket - A new concurrent job scheduler

June 2, 2016, 11:41 pm

≫ Next: Hans-Juergen Schoenig: Watching your PostgreSQL database

≪ Previous: Simon Riggs: PgDay France 2016

Hi All,

I'm so excited to announce about my first contribution tool for postgresql. I have been working with PostgreSQL from 2011 and I'm really impressed with such a nice database.

I started few projects in last 2 years like pgHawk[A beautiful report generator for Openwatch] , pgOwlt [CUI monitoring. It is still under development, incase you are interested to see what it is, attaching the image here for you ],

pgBucket [Which I'm gonna talk about] and learned a lot and lot about PostgreSQL/Linux internals.

Using pgBucket we can schedule jobs easily and we can also maintain them using it's CLI options. We can update/insert/delete jobs at online. And here is its architecture which gives you a basic idea about how it works.

Yeah, I know there are other good job schedulers available for PostgreSQL. I haven't tested them and not comparing them with this, as I implemented it in my way.

Features are:

OS/DB jobs
Cron style sytax
Online job modifications
Required cli options

Dependencies:

C++11

Here is the link for the source/build instructions, which hopefully helpful for you.

Let me know your inputs/suggestions/comments, which will help me to improve this tool.

Thanks as always.

--Dinesh Kumar

↧

Hans-Juergen Schoenig: Watching your PostgreSQL database

June 3, 2016, 1:04 am

≫ Next: Shaun M. Thomas: PG Phriday: Rapid Prototyping

≪ Previous: Dinesh Kumar: pgBucket - A new concurrent job scheduler

Many PostgreSQL users are running their favorite database engine on Linux or some other UNIX system. While Windows is definitely an important factor in the database world, many people like the flexibility of a UNIX-style command line. One feature used by many UNIX people is “watch”. watch runs commands repeatedly, displays their output and errors […]

The post Watching your PostgreSQL database appeared first on Cybertec - The PostgreSQL Database Company.

↧

Shaun M. Thomas: PG Phriday: Rapid Prototyping

June 3, 2016, 12:16 pm

≫ Next: Leo Hsu and Regina Obe: PLV8 for Breaking Long lines of text

≪ Previous: Hans-Juergen Schoenig: Watching your PostgreSQL database

Ah, source control. From Subversion to git and everything in between, we all love to manage our code. The ability to quickly branch from an existing base is incredibly important to exploring and potentially abandoning divergent code paths. One often overlooked Postgres feature is the template database. At first glance, it’s just a way to ensure newly created databases contain some base functionality without having to bootstrap every time, but it’s so much more than that.

Our first clue to the true potential of template databases should probably start with the template0 and template1 databases. Every Postgres instance has them, and unless we’re familiar with the internals or intended use, they’re easy to ignore. They’re empty after all, right? Not quite. Individually, these templates actually define all of the core structures that must exist in all standard Postgres databases. System tables, views, the PUBLIC schema, and even the character encoding, are all defined in template0, and by extension, template1. The primary difference between the two is that template0 is intended to remain pristine in case we need to create a database without any of our own standard bootstrapping code.

Heck, we can’t even connect to the thing.

psql template0
psql: FATAL:  database "template0" is not currently accepting connections

But we can make changes to template1. What happens if we create a table in the template1 database and then create a new database?

\c template1
 
CREATETABLE some_junk (id SERIAL, trash TEXT);
CREATEDATABASE foo;
 
\c foo
\d
 
                List OF relations
 Schema |       Name        |TYPE|  Owner   
--------+-------------------+----------+----------
 public | some_junk         |TABLE| postgres
 public | some_junk_id_seq  |SEQUENCE| postgres

So now we can see firsthand that objects created in template1 will automatically be created in any new database. Existing databases don’t benefit from new objects in any template database; it’s a one-time snapshot of the template state at the time the new database is created. This also applies to any object in the template. Functions, types, tables, views, or anything else in template1, will be copied to new databases upon creation.

This much is almost prevalent enough to be common knowledge. As such, DBAs and some users leverage it for popular extensions and other functionality they want included everywhere. It’s not uncommon to see something like this in a new installation:

\c template1
 
CREATE EXTENSION pg_stat_statements -- Useful query stats.CREATE EXTENSION pgstattuple;       -- Table data distribution info.CREATE EXTENSION postgres_fdw;      -- Postgres foreign data wrapper.CREATE EXTENSION plpythonu;         -- We use a lot of Python.

Now when we create a database, we’ll always have the ability to perform query forensics, analyze table bloat in the storage files, set up connections between other Postgres databases, and deploy python-based procedures. The last one probably isn’t as common as the first three, but if a series of applications make heavy use of Python and all databases in an organization reflect that, it’s nice we have the option. Any database object is created on our behalf, and we don’t even need to ask for it.

What is a bit more esoteric though, is that this also applies to data. Here’s what happens if we put a few rows in our some_junk table:

\c template1
 
INSERTINTO some_junk (trash)SELECT repeat(a.id::TEXT, a.id)FROM generate_series(1,5) a(id);
 
DROPDATABASE foo;
CREATEDATABASE foo;
 
\c foo
 
SELECT*FROM some_junk;
 
 id | trash 
----+-------1|12|223|3334|44445|55555

Well that changes everything, doesn’t it? Imagine this in a development or QA environment, where we may want to repeatedly build and tear down prototypes and test cases. Knowing that we can include data in template databases, means we can start with a master branch of sorts, fork a database including data fixtures, and not worry about foreign keys or other constraints wreaking havoc on the initial import process.

That’s huge, and yet it’s nearly an unknown feature in many circles. The remaining puzzle piece that really unleashes templates however, is that any database can act as a template. It’s far more obvious with template0 and template1 since they include it in their names, but we can create a database and base its contents on any other database in our instance.

Let’s remove the some_junk table from template1 and put it somewhere more appropriate. While we’re at it, let’s define a basic table relationship we can test:

\c template1
 
DROPTABLE some_junk;
 
CREATEDATABASE bootstrap;
 
\c bootstrap
 
CREATETABLE some_junk (
  id         SERIAL PRIMARYKEY,
  trash      TEXT
);
 
CREATETABLE some_stuff (
  id       SERIAL PRIMARYKEY,
  junk_id  INTREFERENCES some_junk (id),
  garbage  TEXT
);
 
INSERTINTO some_junk (trash)SELECT repeat(a.id::TEXT, a.id)FROM generate_series(1,5) a(id);
 
INSERTINTO some_stuff (junk_id, garbage)SELECT a.id % 5+1, repeat(a.id::TEXT, a.id % 5+1)FROM generate_series(1,50) a(id);
 
CREATEDATABASE test_a TEMPLATE bootstrap;

With these two tables in place along with a base set of sample data, we can run any number of verification steps before tearing it down and trying again. Let’s perform a very basic failure test by trying to insert an invalid relationship:

(cat<<EOF | psql test_a &>/dev/null
\set ON_ERROR_STOP on
BEGIN;
INSERT INTO some_stuff (junk_id, garbage) VALUES (6, 'YAY');
ROLLBACK;
EOF
); test$?-eq3&&echo"passed" 
passed

the psql command reports an exit status of 3 when a script fails in some manner. We designed the script to fail, so that’s exactly what we want. In this case, our test passed and we can continue with more tests, or drop the test_a database completely if some of the tests sufficiently tainted the data. We don’t care about any of the contents, and in fact, should throw them away as often as possible to ensure a sterile testing environment.

When it comes to development, we could use the contents of test_a to isolate some potentially dangerous set of code changes without affecting the main working data set in our development environment. Each developer can have their own playground on a shared server. We could even write a system to compare the differences between the main database and one of our forks, and produce a script to affect a migration. There are a lot of exciting use cases lurking here.

All of this does of course carry one rather glaring caveat: it’s not free. It takes time and storage resources to copy the contents of a template database. The larger the data set and count of objects, the more work Postgres must perform in order to initialize everything. If we have 100GB in the template database we’re basing further databases upon, we have to wait for that much data to be copied. Taking that into consideration, there’s probably an upper bound on how large a database can get before using it as a template becomes rather cumbersome.

On the other hand, Postgres is smart enough to realize data files themselves won’t be any different just because they reside in another database container, so it copies them wholesale. We don’t need to wait for index creation or any other high-level allocation command like we would if we were replaying a script. If we increase the row count of some_stuff to five million, filling the table takes 95 seconds on our test VM and consumes about 400MB of space. Creating a new database with it as a template however, merely requires about one second. This drastically increases iterative test throughput, provided we have such a contraption.

I almost never see this kind of usage in the wild, which is a huge shame. It’s not quite git, but we can version and branch our database contents to our heart’s content with templates. Imagine if we have a product that is distributed with a core data set. We could package a deployable binary extract by targeting the proper database for a specific application version, while still maintaining the entire product database history on our side.

Few database engines even have a mechanism for this kind of database cloning. Why let it go to waste?

↧

Leo Hsu and Regina Obe: PLV8 for Breaking Long lines of text

June 3, 2016, 8:59 pm

≫ Next: gabrielle roth: PDXPUG: June meeting

≪ Previous: Shaun M. Thomas: PG Phriday: Rapid Prototyping

Recently we found ourselves needing to purchase and download Zip+4 from the USPS. Zip+4 provides listing of mailable addresses in the US. We intend to use it for address validation.

Each file has one single line with no linefeeds or carriage returns! From spec, each 182-character segment constitutes a record. USPS was nice enough to provide a Java graphical app called CRLF that can inject breaks at specified intervals. That's all nice and well, but with hundreds of files to parse, using their interactive graphical CRLF tool is too tedious.

How could we compose a PostgreSQL function to handle the parsing? Unsure of the performance among procedural languages, we wrote the function in PL/pgSQL, SQL, and PL/V8 to compare. PL/V8 processed the files an astounding 100 times faster than the rest.

PL/V8 is nothing but PL using JavaScript. V8 is a moniker christened by Google to distinguish their JavaScript language engine from all others. It's really not all that different, if at all from any other JavaScript. PL/V8 offers a tiny footprint compared to the stalwarts of PL/Python, PL/Perl, or PL/R. Plus, you can use PL/V8 to create windowing functions. You can't do that with PL/pgSQL and SQL. PL/V8 is sandboxed, meaning that it cannot access web services, network resources, etc. PL/Python, PL/Perl, and PL/R have non-sandboxed versions. For certain applications, being sandboxed is a coup-de-gras.

In our casual use of PL/V8, we found that when it comes to string, array, and mathematical operations, PL/V8 outshines PL/pgSQL, SQL, and in many cases PL/R and PL/Python.

Continue reading "PLV8 for Breaking Long lines of text"

↧

Syntax

Use Cases

Compatibility

Conforming Alternatives

Proprietary Extensions

PostgreSQL: Subqueries Allowed

Why Bloat Happens

The new JSONB Operators

Concatenate with ||

Delete with -

Delete with #-

The new JSONB functions

jsonb_set

jsonb_pretty

Conclusion

Introduction to RLS

Application users

Session variables

Signed session variables

Crypto

Alternative approaches

Getting rid of the passphrase

Summary

Solution one: manual copy

Solution two: truncate the delta

Solution three: delta excision

Table of Contents

Intro

Overview of the UpStats project

Tables involved in notifications

PosgreSQL's LISTEN/NOTIFY in Python

Putting together user notifications

Tightening the constraints

Optimizing the search streams

Conclusion

Footnotes: