US PostgreSQL Association: PgUS: Welcomes Dallas/Forth Worth PostgreSQL User Group

September 24, 2014, 11:59 am

≫ Next: US PostgreSQL Association: PgUS moves to Google Apps

≪ Previous: John Graber: I Shall Return... to Postgres Open

As United States PostgreSQL continues its support for PostgreSQL User Groups it is my pleasure to announce that the Dallas/Forth Worth PostgreSQL User Group has decided to become part of the PgUS family. The resolution passed with 100% consent and it is great to see them on board. You may visit the resolution here:

https://postgresql.us/node/144

↧

US PostgreSQL Association: PgUS moves to Google Apps

September 24, 2014, 12:03 pm

≫ Next: Payal Singh: Size Difference Between Partitioned and Non-partitioned Tables

≪ Previous: US PostgreSQL Association: PgUS: Welcomes Dallas/Forth Worth PostgreSQL User Group

In an effort to utilize infrastructure that is far more capable than what we can as volunteers provide, PgUS has moved to Google Apps. As a 501c3 we are offered a host of benefits that will help us grow the organization as well as increase collaboration and communication among the board and corporation members. We expect to be performing technical and advocacy hangouts often, as well as configuring a PgUS YouTube channel to put forth video resources for those wanting to know and learn more about PostgreSQL.

Look for more news in the future. It is exciting times!

↧

Payal Singh: Size Difference Between Partitioned and Non-partitioned Tables

September 24, 2014, 1:43 pm

≫ Next: Josh Berkus: Why there will be no more video for SFPUG

≪ Previous: US PostgreSQL Association: PgUS moves to Google Apps

I inserted rows from a couple of child tables of a large partitioned table to a new single table. The column ordering was preserved. Found that although the size of child tables as a whole was more than the single, non-partitioned table with the same data, the difference itself was in KB, and hence not very significant.

First, I inserted rows from 3 child tables (each ~700-800MB) into my new non-partitioned table:

$ insert into payal.hits3 select * from tracking.hits_p2014_08_25;

INSERT 0 11992623

$ insert into payal.hits3 select * from tracking.hits_p2014_08_26;

INSERT 0 13127131

$ insert into payal.hits3 select * from tracking.hits_p2014_08_27;

INSERT 0 13095656

Then, I did a vacuum full for each of the child tables, and my new non-partitioned table:

$ vacuum full verbose tracking.hits_p2014_08_26;

INFO: vacuuming "tracking.hits_p2014_08_26"

INFO: "hits_p2014_08_26": found 0 removable, 13127131 nonremovable row versions in 111578 pages

DETAIL: 0 dead row versions cannot be removed yet.

CPU 3.61s/8.79u sec elapsed 39.03 sec.

VACUUM

$ vacuum full verbose tracking.hits_p2014_08_27;

INFO: vacuuming "tracking.hits_p2014_08_27"

INFO: "hits_p2014_08_27": found 0 removable, 13095656 nonremovable row versions in 111268 pages

DETAIL: 0 dead row versions cannot be removed yet.

CPU 15.85s/8.12u sec elapsed 39.19 sec.

VACUUM

$ vacuum full verbose tracking.hits_p2014_08_25;

INFO: vacuuming "tracking.hits_p2014_08_25"

INFO: "hits_p2014_08_25": found 0 removable, 11992623 nonremovable row versions in 99752 pages

DETAIL: 0 dead row versions cannot be removed yet.

CPU 2.15s/7.57u sec elapsed 49.11 sec.

VACUUM

$ vacuum full verbose payal.hits3;

INFO: vacuuming "payal.hits3"

INFO: "hits3": found 0 removable, 38215410 nonremovable row versions in 317945 pages

DETAIL: 0 dead row versions cannot be removed yet.

CPU 11.78s/27.62u sec elapsed 137.90 sec.

VACUUM

Lets see what pg_size_pretty returns:

$ select pg_size_pretty(pg_relation_size('payal.hits3'));

pg_size_pretty

────────────────

2484 MB

(1 row)

$ select pg_size_pretty(pg_relation_size('tracking.hits_p2014_08_25') + pg_relation_size('tracking.hits_p2014_08_26') + pg_relation_size('tracking.hits_p2014_08_27'));

pg_size_pretty

────────────────

2484 MB

(1 row)

Lets see if there's a difference without pg_size_pretty:

$ select pg_relation_size('payal.hits3');

pg_relation_size

──────────────────

2604605440

(1 row)

$ select pg_relation_size('tracking.hits_p2014_08_25') + pg_relation_size('tracking.hits_p2014_08_26') + pg_relation_size('tracking.hits_p2014_08_27');

?column?

────────────

2604654592

(1 row)

So a difference of 49152 bytes with tables of the order of GB. To see if the difference was proportional to the table size, I added one more child table:

$ insert into payal.hits3 select * from tracking.hits_p2014_08_28;

INSERT 0 12470437

$ vacuum full verbose tracking.hits_p2014_08_28;

INFO: vacuuming "tracking.hits_p2014_08_28"

INFO: "hits_p2014_08_28": found 0 removable, 12470437 nonremovable row versions in 106123 pages

DETAIL: 0 dead row versions cannot be removed yet.

CPU 1.96s/7.74u sec elapsed 40.25 sec.

VACUUM

$ select pg_relation_size('payal.hits3');

pg_relation_size

──────────────────

3454902272

(1 row)

$ select pg_relation_size('tracking.hits_p2014_08_25') + pg_relation_size('tracking.hits_p2014_08_26') + pg_relation_size('tracking.hits_p2014_08_27') + pg_relation_size('tracking.hits_p2014_08_28');

?column?

────────────

3454984192

(1 row)

Now, the difference has increased to 81920 bytes.

↧

Josh Berkus: Why there will be no more video for SFPUG

September 25, 2014, 11:01 am

≫ Next: Paul Ramsey: Getting distinct pixel values and pixel value counts of a raster

≪ Previous: Payal Singh: Size Difference Between Partitioned and Non-partitioned Tables

For the past couple of years, SFPUG has tried (stress on "tried") to share our meetups and talks with the world via streaming and archival video. At first, this was wonderful because it allowed folks without a good local user group to tune in and participate, among other things helping launch our Beijing PUG. This is now coming to an end because it is simply too hard to do, and nobody is stepping up to make it happen.

First, we have the issue that there simply aren't good platforms for streaming for a low-budget nonprofit anymore. JustinTV is off the air, Ustream has scads of obnoxious ads which interrupt the talk (or costs $100/month for "pro"), and Google Hangouts on Air simply don't work. For proof of the latter, try to watch to the end of this presentation. Alternatives requires setting up your own streaming website gateway and video archives, and I simply don't have time.

And there we're getting to the big reason why SFPUG video will stop: nobody has volunteered to do it. I have my hands full scheduling and running the meetings. I've called for video volunteers from the meetup group, but nobody has stepped forwards. So, no more video.

That out of the way, we will be reposting a bunch of archival videos from JustinTV onto YouTube. Announcements when they're all up.

↧

Paul Ramsey: Getting distinct pixel values and pixel value counts of a raster

September 25, 2014, 5:00 pm

≫ Next: Greg Sabino Mullane: Solving pg_xlog out of disk space problem on Postgres

≪ Previous: Josh Berkus: Why there will be no more video for SFPUG

PostGIS raster has so so many functions and probably at least 10 ways of doing something some much much slower than others. Suppose you have a raster, or you have a raster area of interest — say elevation raster for example, and you want to know the distinct pixel values in the area. The temptation is to reach for ST_Value function in raster, but there is a much much more efficient function to use, and that is the ST_ValueCount function.

ST_ValueCount function is one of many statistical raster functions available with PostGIS 2.0+. It is a set returning function that returns 2 values for each row: a pixel value (val), and a count of pixels in the raster that have that value. It also has variants that allow you to filter for certain pixel values.

This tip was prompted by the question on stackexchange How can I extract all distinct values from a PostGIS Raster?

Continue Reading by clicking title hyperlink ..

↧

Greg Sabino Mullane: Solving pg_xlog out of disk space problem on Postgres

September 25, 2014, 8:10 pm

≫ Next: Andrew Dunstan: Importing JSON data

≪ Previous: Paul Ramsey: Getting distinct pixel values and pixel value counts of a raster

pg_xlog with a dummy file
(image by Andrew Malone)

Running out of disk space in the pg_xlog directory is a fairly common Postgres problem. This important directory holds the WAL (Write Ahead Log) files. (WAL files contain a record of all changes made to the database - see the link for more details). Because of the near write‑only nature of this directory, it is often put on a separate disk. Fixing the out of space error is fairly easy: I will discuss a few remedies below.

When the pg_xlog directory fills up and new files cannot be written to it, Postgres will stop running, try to automatically restart, fail to do so, and give up. The pg_xlog directory is so important that Postgres cannot function until there is enough space cleared out to start writing files again. When this problem occurs, the Postgres logs will give you a pretty clear indication of the problem. They will look similar to this:

PANIC:  could not write to file "pg_xlog/xlogtemp.559": No space left on device
STATEMENT:  insert into abc(a) select 123 from generate_series(1,12345)
LOG:  server process (PID 559) was terminated by signal 6: Aborted
DETAIL:  Failed process was running: insert into abc(a) select 123 from generate_series(1,12345)
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally an
d possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted; last known up at 2014-09-16 10:36:47 EDT
LOG:  database system was not properly shut down; automatic recovery in progress
FATAL:  the database system is in recovery mode
LOG:  redo starts at 0/162FE44
LOG:  redo done at 0/1FFFF78
LOG:  last completed transaction was at log time 2014-09-16 10:38:50.010177-04
PANIC:  could not write to file "pg_xlog/xlogtemp.563": No space left on device
LOG:  startup process (PID 563) was terminated by signal 6: Aborted
LOG:  aborting startup due to startup process failure

The "PANIC" seen above is the most severe log_level Postgres has, and it basically causes a "full stop right now!". You will note in the above snippet that a normal SQL command caused the problem, which then caused all other Postgres processes to terminate. Postgres then tried to restart itself, but immediately ran into the same problem (no disk space) and thus refused to start back up. (The "FATAL" line above was another client trying to connect while all of this was going on.)

Before we can look at how to fix things, a little background will help. When Postgres is running normally, there is a finite number of WAL files (roughly twice the value of checkpoint_segments) that exist in the pg_xlog directory. Postgres deletes older WAL files, so the total number of files never climbs too high. When something prevents Postgres from removing the older files, the number of WAL files can grow quite dramatically, culminating in the out of space condition seen above. Our solution is therefore two-fold: fix whatever is preventing the old files from being deleted, and clear out enough disk space to allow Postgres to start up again.

The first step is to determine why the WAL files are not being removed. The most common case is a failing archive_command. If this is the case, you will see archive-specific errors in your Postgres log. The usual causes are a failed network, downed remote server, or incorrect copying permissions. You might see some errors like this:

2013-05-06 23:51:35 EDT [19421]: [206-1] user=,db=,remote= LOG:  archive command failed with exit code 14
2013-05-06 23:51:35 EDT [19421]: [207-1] user=,db=,remote= DETAIL:  The failed archive command was: rsync --whole-file --ignore-existing --delete-after -a pg_xlog/000000010000006B00000016 backup:/archive/000000010000006B00000016
rsync: Failed to exec ssh: Permission denied (13)
# the above was from an actual bug report; the problem was SELinux

There are some other reasons why WAL would not be removed, such as failure to complete a checkpoint, but they are very rare so we will focus on archive_command. The quickest solution is to fix the underlying problem by bringing the remote server back up, fixing the permissions, etc. (To debug, try emulating the archive_command you are using with a small text file, as the postgres user. It is generally safe to ship non-WAL files to the same remote directory). If you cannot easily or quickly get your archive_command working, change it to a dummy command that always returns true:

# On Nix boxes:
archive_command = '/bin/true'
# On BSD boxes:
archive_command = '/usr/bin/true'
# On Windows boxes:
archive_command = 'REM'

This will allow the archive_command to complete successfully, and thus lets Postgres start removing older, unused WAL files. However, you cannot start the server yet, because the lack of disk space is still a problem. Here is what the logs would look like if you tried to start it up again:

LOG:  database system shutdown was interrupted; last known up at 2014-09-16 10:38:54 EDT
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 0/162FE44
LOG:  redo done at 0/1FFFF78
LOG:  last completed transaction was at log time 2014-09-16 10:38:50.010177-04
PANIC:  could not write to file "pg_xlog/xlogtemp.602": No space left on device
LOG:  startup process (PID 602) was terminated by signal 6: Aborted
LOG:  aborting startup due to startup process failure

At this point, you must provide Postgres a little bit of room in the partition/disk that the pg_xlog directory is in. There are three approaches to doing so: removing some non-WAL files, resizing the partition, and removing some of the WAL files yourself.

The easiest solution is to clear up space by removing any non-WAL files that are on the same partition. If you do not have pg_xlog on its own partition, just remove a few files (or move them to another partition) and then start Postgres. You don't need much space - a few hundred megabytes should be more than enough.

This problem occurs often enough that I have a best practice: create a dummy file on your pg_xlog partition whose sole purpose is to get deleted after this problem occurs, and thus free up enough space to allow Postgres to start! Disk space is cheap these days, so just create a 300MB file and put it in place like so (on Linux):

dd if=/bin/zero of=/pgdata/pg_xlog/DO_NOT_MOVE_THIS_FILE bs=1MB count=300

This is a nice trick, because you don't have to worry about finding a file to remove, or determine which WALs to delete - simply move or delete the file and you are done. Once things are back to normal, don't forget to put it back in place.

Another way to get more space in your pg_xlog partition is to resize it. Obviously this is only an option if your OS/filesystem has been setup to allow resizing, but if it is, this is a quick and easy way to give Postgres enough space to startup again. No example code on this one, as the way to resize disks varies so much.

The final way is to remove some older WAL files. You need to determine which files are safest to remove. One way to determine this is to use the pg_controldata program. Just run it with the location of you data directory as the only argument, and you should be rewarded with a screenfull of arcane information. The important lines will look like this:

Latest checkpoint's REDO location:    0/4000020
Latest checkpoint's REDO WAL file:    000000010000000000000005

This second line represents the last WAL file processed, and it should be safe to remove any files older than that one. (Unfortunately, older versions of PostgreSQL will not show that line, and only the REDO location. While the canonical way to translate the location to a filename is with the pg_xlogfile_name() function, it is of little use in this situation, as it requires a live database! Thus, you may need another solution.)

Once you know which WAL file to keep by looking at the pg_controldata output, you can simply delete all WAL files older than that one. As with all mass deletion actions, I recommend a three-part approach. First, back everything up. This could be as simple as copying all the files in the pg_xlog directory somewhere else. Second, do a trial run. This means seeing what the deletion would do without actually deleting the files. For some commands, this means using a --dry-run or similar option, but in our example below, we can simply leave out the "-delete" argument. Third, carefully perform the actual deletion. In our example above, we could clear the old WAL files by doing:

$ cp -r /pgdata/pg_xlog/* /home/greg/backups/
$ find -L /pgdata/pg_xlog -not -newer /pgdata/pg_xlog/000000010000000000000005 | sort | less
$ find -L /pgdata/pg_xlog -not -newer /pgdata/pg_xlog/000000010000000000000005 -delete

Once you have straightened out the archive_command and cleared out some disk space, you are ready to start Postgres up. You may want to adjust your pg_hba.conf to keep everyone else out until you verify all is working. When you start Postgres, the logs will look like this:

LOG:  database system was shut down at 2014-09-16 10:28:12 EDT
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started

After a few minutes, check on the pg_xlog directory, and you should see that Postgres has deleted all the extra WAL files, and the number left should be roughly twice the checkpoint_segments setting. If you adjusted pg_hba.conf, adjust it again to let clients back in. If you changed your archive_command to always return truth, remember to change it back as well as generate a new base backup

Now that the problem is fixed, how do you prevent it from happening again? First, you should use the 'tail_n_mail' program to monitor your Postgres log files, so that the moment the archive_command starts failing, you will receive an email and can deal with it right away. Making sure your pg_xlog partition has plenty of space is a good strategy as well, as the longer it takes to fill up, the more time you have to correct the problem before you run out of disk space.

Another way to stay on top of the problem is to get alerted when the pg_xlog directory starts filling up. Regardless of whether it is on its own partition or not, you should be using a standard tool like Nagios to alert you when the disk space starts to run low. You can also use the check_postgres program to alert you when the number of WAL files in the pg_xlog directory goes above a specified number.

In summary, things you should do now to prevent, detect, and/or mitigate the problem of running out of disk space in pg_xlog:

Move pg_xlog to its own partition. This not only increases performance, but keeps things simple and makes thins like disk resizing easier.
Create a dummy file in the pg_xlog directory as described above. This is a placeholder file that will prevent the partition from being completely filled with WAL files when 100% disk space is reached.
Use tail_n_mail to instantly detect archive_command failures and deal with them before they lead to a disk space error (not to mention the stale standby server problem!)
Monitor the disk space and/or number of WAL files (via check_postgres) so that you are notified that the WALs are growing out of control. Otherwise your first notification may be when the database PANICs and shuts down!

In summary, don't panic if you run out of space. Do the steps above, and rest assured that no data corruption or data loss has occurred. It's not fun, but there are far worse Postgres problems to run into! :)

↧

Andrew Dunstan: Importing JSON data

September 26, 2014, 5:54 am

≫ Next: Andrew Dunstan: Big O playing catchup.

≪ Previous: Greg Sabino Mullane: Solving pg_xlog out of disk space problem on Postgres

Say you have a file that consists of one JSON document per line. How can you import it into a table easily? This is a problem I was called on to help a colleague with yesterday. Using COPY is the obvious answer, but this turns out not to be quite so simple to do.

In text mode, COPY will be simply defeated by the presence of a backslash in the JSON. So, for example, any field that contains an embedded double quote mark, or an embedded newline, or anything else that needs escaping according to the JSON spec, will cause failure. And in text mode you have very little control over how it works - you can't, for example, specify a different ESCAPE character. So text mode simply won't work.

CSV mode is more flexible, but poses different problems. Here, instead of backslash causing a problem, QUOTE characters can cause a problem. First, JSON itself uses the default QUOTE character (double quote) to quote all string values. But if we change use an alternative like single quote, then the presence of any single quote in the JSON leads us into difficulties. Second, JSON also uses the default DELIMITER (comma) extensively. So, clearly we need to use something else for the QUOTE and DELIMITER options. (By default, in CSV mode, the ESCAPE character is the same as the QUOTE character, so we don't need to worry about it separately.)

What we in fact want is to specify QUOTE and DELIMITER characters that can't appear at all in the JSON. Then the whole line will be seen as a single unquoted datum, which is exactly what we want. There is a small set of single-byte characters that happen to be illegal in JSON, so we can be sure that choosing them for these options should do the right thing with any legal JSON. These are the control characters. So the solution we came up with looks like this:

copy the_table(jsonfield) 
from '/path/to/jsondata' 
csv quote e'\x01' delimiter e'\x02';

Of course, if the JSON has embedded newlines as punctuation, this wont work. So it's important that you configure whatever is producing the JSON not to insert newlines anywhere but at the end of each JSON document.

Now this solution is a bit of a hack. I wonder if there's a case for a COPY mode that simply treats each line as a single datum. I also wonder if we need some more specialized tools for importing JSON, possibly one or more Foreign Data Wrappers. Such things could handle, say, embedded newline punctuation.

Note too that files produced by PostgreSQL's COPY ... TO command will be properly quoted and escaped and won't need to be handled like this to read them back. Of course, if you want them to be readable by other non-CSV processors, then you might need to use similar options to those above to avoid unwanted quoting and escaping.

↧

Andrew Dunstan: Big O playing catchup.

September 26, 2014, 12:38 pm

≫ Next: Michael Paquier: Postgres 9.5 feature highlight: Improved verbose logs in pg_dump

≪ Previous: Andrew Dunstan: Importing JSON data

I see that a new release of MySQL has been made, and they are touting the fact that they are allowing the omission of unaggregated items in a SELECT list from a GROUP BY clause, if they are functionally dependent on the items in the GROUP BY clause. This would happen, for example, where the items in the GROUP BY list form a primary key. It's a nice feature.

It's also a feature that PostgreSQL has had for three years.

↧

Michael Paquier: Postgres 9.5 feature highlight: Improved verbose logs in pg_dump

September 28, 2014, 7:24 am

≫ Next: Oleg Bartunov: Jsquery - now with optimizer and hinting. EDB, more benchmarks !

≪ Previous: Andrew Dunstan: Big O playing catchup.

The following simple commit has improved the verbose logs of pg_dump (the ones that can be invocated with option -v and that are useful to keep a log trace when using cron jobs kicking pg_dump), by making the schema names of the relations dumped show up as well:

commit: 2bde29739d1e28f58e901b7e53057b8ddc0ec286
author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
date: Tue, 26 Aug 2014 11:50:48 +0300
Show schema names in pg_dump verbose output.

Fabrízio de Royes Mello, reviewed by Michael Paquier

Let's take the case of a simple schema, with the same table name used on two different schemas:

=# CREATE SCHEMA foo1;
CREATE SCHEMA
=# CREATE SCHEMA foo2;
CREATE SCHEMA
=# CREATE TABLE foo1.dumped_table (a int);
CREATE TABLE
=# CREATE TABLE foo2.dumped_table (a int);
CREATE TABLE

With pg_dump bundled with 9.4 and older versions, each relation cannot be really identified (think about the case of having multiple versions of an application schema stored in the same database, but with different schema names):

$ pg_dump -v 2>&1 >/dev/null | grep dumped_table | grep TABLE
pg_dump: creating TABLE dumped_table
pg_dump: creating TABLE dumped_table
pg_dump: setting owner and privileges for TABLE dumped_table
pg_dump: setting owner and privileges for TABLE dumped_table
pg_dump: setting owner and privileges for TABLE DATA dumped_table
pg_dump: setting owner and privileges for TABLE DATA dumped_table

Now with 9.5, the following logs are showed.

$ pg_dump -v 2>&1 >/dev/null | grep dumped_table | grep TABLE
pg_dump: creating TABLE "foo1"."dumped_table"
pg_dump: creating TABLE "foo2"."dumped_table"
pg_dump: setting owner and privileges for TABLE "foo1"."dumped_table"
pg_dump: setting owner and privileges for TABLE "foo2"."dumped_table"
pg_dump: setting owner and privileges for TABLE DATA "foo1"."dumped_table"
pg_dump: setting owner and privileges for TABLE DATA "foo2"."dumped_table"

Note as well the quotes put around the relation and schema names, making this output more consistent with the other utilities in PostgreSQL. Also, this is of course not only limited to relations, but to any object that can be defined on a schema.

↧

Oleg Bartunov: Jsquery - now with optimizer and hinting. EDB, more benchmarks !

September 28, 2014, 11:57 am

≫ Next: Josh Berkus: Why you need to avoid Linux Kernel 3.2

≪ Previous: Michael Paquier: Postgres 9.5 feature highlight: Improved verbose logs in pg_dump

EDB recently blogged new results from benchmarking PostgreSQL 9.4 and Mongodb

The newest round of performance comparisons of PostgreSQL and MongoDB produced a near repeat of the results from the first tests that proved PostgreSQL can outperform MongoDB. The advances Postgres has made with JSON and JSONB have transformed Postgres’ ability to support a document database.

This blog motivates me to write this post to point EDB for another set of benchmarks with more operators included provided by jsquery.

After PGCon-2014, where we presented first version of jsquery, we made several enhancements worth to mention (see my slides from Japan (PDF).

1). We added simple built-in jsquery optimizer, which recognizes non-selective part of a query and push it to recheck, so recheck works like a FILTER.
2) If you don't like, how optimizer works, you may use HINTING (well, jsquery is an extension after all).

We understand, that this is just a temporal solution for impatient people wante to use jsonb in 9.4, which, honestly, has rather primitive support. Yes, we just didn't have time to do all the best, we even missed several useful functions we did for nested hstore. Hope, we'll have contrib/jsonbx soon. Jsquery was our experiment to play with indexes and the set of operations was chosen especially from this point of view. We are working on better approach, where jsquery will be implemented on sql-level (see this post (in russian) and eventually, after someone implements statistics for jsonb, optimizer wil do its work !

More details are below.

1). Optimizer. Jsquery is opaque to optimizer, so original version had very distressed problem (we demonstrated this at PGCon-2014:

select count(*) from jr where jr @@ ' similar_product_ids && ["B000089778"] 
AND product_sales_rank( $ > 10000 AND $  < 20000)';

runs 129.309 ms, while

 
select count(*) from jr where jr @@  ' similar_product_ids && ["B000089778"]';

took only 0.394 ms !

product_sales_rank( $ > 10000 AND $ < 20000) is non-selective and better not to use index for this part of query and this is exactly how MongoDB does:


db.reviews.find(  {  $and :[ {similar_product_ids: { $in:["B000089778"]}},    {product_sales_rank:{$gt:10000, $lt:20000}}] } )
.explain()
{
	"n" : 45,
	 ….................
	"millis" : 7,
	"indexBounds" : {
		"similar_product_ids" : [                                   
			[
				"B000089778",
				"B000089778"
			]
		]
	},
}

Notice, that if we rewrite our query to


select count(*) from jr where  jr @@ ' similar_product_ids && ["B000089778"]' 
and  (jr->>'product_sales_rank')::int>10000 and (jr->>'product_sales_rank')::int<20000;

(push non-selective part of query up to sql-level, so optimizer could do something), the query runs 0.505 ms, which means, that postgres potentially (again) is faster MongoDB !

The plan of this query is:

Aggregate (actual time=0.479..0.479 rows=1 loops=1)
   ->  Bitmap Heap Scan on jr (actual time=0.079..0.472 rows=45 loops=1)
         Recheck Cond: (jr @@ '"similar_product_ids" && ["B000089778"]'::jsquery)
         Filter: ((((jr ->> 'product_sales_rank'::text))::integer > 10000) AND 
(((jr ->> 'product_sales_rank'::text))::integer < 20000))
         Rows Removed by Filter: 140
         Heap Blocks: exact=107
         ->  Bitmap Index Scan on jr_path_value_idx (actual time=0.041..0.041 rows=185 loops=1)
               Index Cond: (jr @@ '"similar_product_ids" && ["B000089778"]'::jsquery)
 Execution time: 0.506 ms

Now, jsquery is wise enough:

explain (analyze, costs off) select count(*) from jr where jr @@ 'similar_product_ids &&["B000089778"] 
AND product_sales_rank( $ > 10000 AND $  < 20000)'
------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate (actual time=0.422..0.422 rows=1 loops=1)
   ->  Bitmap Heap Scan on jr (actual time=0.099..0.416 rows=45 loops=1)
         Recheck Cond: (jr @@ '("similar_product_ids" && ["B000089778"] AND "product_sales_rank"($ > 10000 AND $ < 20000))'::jsquery)
         Rows Removed by Index Recheck: 140
         Heap Blocks: exact=107
         ->  Bitmap Index Scan on jr_path_value_idx (actual time=0.060..0.060 rows=185 loops=1)
               Index Cond: (jr @@ '("similar_product_ids" && ["B000089778"] AND "product_sales_rank"($ > 10000 AND $ < 20000))'::jsquery)
 Execution time: 0.480 ms

Compare RECHECK in this plan and FILTER in plan of rewritten query above. Built-in optimizer analyzes the query tree and push non-selective parts to recheck. It operated with selectivity classes:


1) Equality (x = c) 
2) Range (c1 < x < c2) 
3) Inequality (c > c1) 
4) Is (x is type) 
5) Any (x = *)

Jsquery provides two debug functions gin_debug_query_path_value and gin_debug_query_value_path for each of opclasses to check how jsquery will process the query:

SELECT gin_debug_query_path_value('similar_product_ids && ["B000089778"] 
                       AND product_sales_rank( $ > 10000 AND $  < 20000)');
           gin_debug_query_path_value
-------------------------------------------------
 similar_product_ids.# = "B000089778" , entry 0 +

Only the first part of query will be processed by index, created using opclass jsonb_path_value.

2). HINTING. Use /*-- noindex */ or /*-- index */ before operator to suppress using index, or force index. To illustrate hinting I'll use debug function gin_debug_query_path_value:

SELECT gin_debug_query_path_value('product_sales_rank > 10000');
      gin_debug_query_path_value
---------------------------------------
 product_sales_rank > 10000 , entry 0 +

SELECT gin_debug_query_path_value('product_sales_rank /*-- noindex */ > 10000');
 gin_debug_query_path_value
----------------------------
 NULL                      +

Jsquery is available from Git repository and it's compatible with 9.4. To install it just follow regular procedure of installation of extensions. Examples of usage are available from sql/ subdirectory.

↧

Josh Berkus: Why you need to avoid Linux Kernel 3.2

September 29, 2014, 4:49 pm

≫ Next: gabrielle roth: My PgConf.EU Schedule

≪ Previous: Oleg Bartunov: Jsquery - now with optimizer and hinting. EDB, more benchmarks !

In fact, you really need to avoid every kernel between 3.0 and 3.8. While RHEL has been sticking to the 2.6 kernels (which have their own issues, but not as bad as this), Ubuntu has released various 3.X kernels for 12.04. Why is this an issue? Well, let me give you two pictures.

Here's private benchmark workload running against PostgreSQL 9.3 on Ubuntu 12.04 with kernel 3.2.0. This is the IO utilization graph for the replica database, running a read-only workload:

Sorry for cropping; had to eliminate some proprietary information. The X-axis is time. This graph shows MB/s data transfers -- in this case, reads -- for the main database disk array. As you can see, it goes from 150MB/s to over 300MB/s. If this wasn't an SSD array, this machine would have fallen over.

Then we upgraded it to kernel 3.13, and ran the same exact workload as the previous test. Here's the new graph:

Bit of a difference, eh? Now we're between 40 and 60MB/s for the exact same workload: an 80% reduction in IO. We can thank the smart folks in the Linux FS/MM group for hammering down a whole slew of performance issues.

So, check your Postgres servers and make sure you're not running a bad kernel!

↧

gabrielle roth: My PgConf.EU Schedule

September 29, 2014, 6:58 pm

≫ Next: Tomas Vondra: Examples of palloc overhead

≪ Previous: Josh Berkus: Why you need to avoid Linux Kernel 3.2

Yep, I’m headed to Madrid! I’ll be reprising my Autovacuum talk from SCALE, and am really looking forward to meeting some new folks. I’ll be helping out at the conference in some capacity, so come say hello. For reference, the conference schedule is here: http://www.postgresql.eu/events/schedule/pgconfeu2014/ Other talks I plan to attend: Wednesday: Performance Archaeology sounds […]

↧

Tomas Vondra: Examples of palloc overhead

September 30, 2014, 8:00 am

≫ Next: Payal Singh: Changing Owner of Multiple Database Objects

≪ Previous: gabrielle roth: My PgConf.EU Schedule

This is the post I promised last week, explaining a few common issues with memory contexts. The issues mostly lead to excessive overhead, i.e. excessive usage of resources - usually memory. And that's exactly what this post is about, so whenever I say "overhead" you can read "excessive memory usage." I will also try to give advices on how to avoid those issues or minimize the impact.

As I briefly mentioned when explaining allocation set internals, there are three main sources of overhead:

chunk headers - for large chunks, this gets negligible
unused space (because of 2^N chunks) - expected ~25% for randomly sized chunks, but can get much worse
reuse not working efficiently - we'll see some examples how this can happen

If you haven't read that post, it's probably the right time to do that. Also, if you want to learn more about memory management and allocator implementations, there's a great post at IBM developerWorks explaining it quite well and also listing many interesting additional resources (e.g. various malloc implementations).

So let's see some usual (but somehow unexpected) examples of palloc overhead.

Allocating many tiny pieces of memory

Say we need to store a lot of small elements, a few bytes each - e.g. 64-bit integers, words in an ispell dictionary or something like that. You could do a separate palloc call for each element (and keep the pointer), but in that case you'll pay the 'header' price for each element. For example by doing this

int64*items[1024];for(i=0;i<1024;i++)items[i]=(int*)palloc(sizeof(int64));

you might think you allocated ~8kB of memory (1024 x 8B), but in fact this prepends each value with 16B header. So you end up with about 24kB (not counting the items array, which needs additional 8kB), which is ~3x the requested amount. If we asked for smaller values (e.g. 4B integers), the overhead would be even larger because 8B is the smallest chunk allocated by palloc (as mentioned in the previous post).

Let's assume all the elements have about the same life span / scope (i.e. can be freed at the same time), and don't need to be passed somewhere else (i.e. are used only locally). In such cases the best approach is often preallocating a large chunk of memory, and then slicing it into small pieces - essentially doing what the allocation set allocator does when slicing the block into chunks, but without the chunk headers.

Of course, that means you won't be able to call pfree on the elements, but that's OK because we assumed only local usage (if we pass one of the elements somewhere else, that code would be unaware of this). But if the elements have the same life span (as for example words in an ispell dictionary), we don't really need to call the pfree on the individual words as the dictionary gets freed at once.

This is especially trivial to do when the elements are of fixed size, and you know how many of them to expect - in that case you can just do

char*tmp=palloc(elementSize*elementCount)tmp->element#0(tmp+elementSize)->element#1...(tmp+k*elementSize)->element#k

and you're done. But what to do when the elements are variable-sized, or when you don't know how many of them to expect? Well, who says you have to allocate all the elements at once? You can allocate a large chunk of memory, use it for elements until you run our of space, then allocate another one and start over.

A very simple example of this approach is compact_palloc0 method in spell.c, which is used for efficient allocation of words and other dictionary-related data (for the purpose of fulltext-search) (the following code is somewhat simplified, to make it easier to understand):

# define COMPACT_ALLOC_CHUNK 8192/* current chunk and available space */char*current_ptr=NULL;Sizeavailable_space=0;staticvoid*compact_palloc0(IspellDict*Conf,size_tsize){void*result;/* Need more space? */if(size>available_space){current_ptr=palloc0(COMPACT_ALLOC_CHUNK);available_space=COMPACT_ALLOC_CHUNK;}result=(void*)current_ptr;current_ptr+=size;available_space-=size;returnresult;}

This effectively adds another (very simple) allocator on top of the AllocSet. The "large" chunks (allocated using palloc) are registered in the parent memory context and will be freed when that memory allocator gets destroyed.

Another example such "dense" allocation was recently committed into our hashjoin implementation - it's however slightly more complicated (to support some hashjoin-specific requirements). See commit 45f6240a for more details.

Allocating pieces that are not 2^N bytes

The other trap you may fall into is the 2^N sizing of chunks. There are plenty of ways how to "achieve" that - I'll show you a real-world example from my quantile extension. I already fixed it, so you have to look at commit c1e7bba9 to see it.

Let's say you're writing an aggregate function MEDIAN(float) - to achieve that, you need to collect the float values into the aggregate state (so that you can later sort them and choose the "middle" value, which is the median). The aggregate state might be represented by this structure

typedefstructmedian_state{intnelements;intnext;float*elements;}median_state;

The float elements are accumulated into 'elements' array, which is resized on the fly as needed. The size of the array (including unusued part) is tracked by 'nelements' and 'next' is index of the next available element. So when next == nelements happends, we need to enlarge the array, which is done like this:

#define SLICE_SIZE 5...if(state->nelement==0){state->nelements=SLICE_SIZE;state->elements=(float*)palloc(sizeof(float)*data->nelements);}elseif(state->next>state->nelements-1){state->nelements=state->nelements+SLICE_SIZE;state->elements=(float*)repalloc(state->elements,sizeof(float)*data->nelements);}

Can you spot the problem? It's pretty obvious :-/

Well, every time we call repalloc, the 'elements' array grows by 20 bytes (SLICE_SIZE * 4B). That's rather pointless, because it either fits into the current chunk (and then it's almost free), or the array has to be moved to a larger chunk. But it does not save any memory at all.

Moreover there's a small inefficiency, because 20 is not a divisor of chunk sizes (following the 2^N rule). This wastes a bit of memory, but for larger chunks this is negligible.

It however clearly shows that constant growth does not work, because it does not follow the 2^N chunk size pattern. A saner version of the resize would be about this:

#define SLICE_SIZE 8...if(state->nelement==0){state->nelements=SLICE_SIZE;state->elements=(float*)palloc(sizeof(float)*data->nelements);}elseif(state->next>state->nelements-1){state->nelements=state->nelements*2;state->elements=(float*)repalloc(state->elements,sizeof(float)*data->nelements);}

So, this is better - it starts with 8 elements (because 8 * 4B = 32B, which is 2^5 and thus follows the 2^N rule). I could have used smaller value (up to 2, because the smallest chunk is 8B). This would have about the same memory consumption as before, but it's more correct and the repalloc will be called much less frequently (exponentially less).

Now, let's see a more serious example, that I almost commited into count_distinct but luckily realized the error before doing that:

Let's say the state for the MEDIAN() aggregate was defined like this:

typedefstructmedian_state{intnelements;intnext;floatelements[1];}median_state;

Placing a single-element array at the end of struct is a well known trick to define variable length structures. The resize then might look like this:

#define SLICE_SIZE 8...if(state->nelement==0){state->nelements=SLICE_SIZE;state->elements=(float*)palloc(offsetof(median_state,element)+sizeof(float)*data->nelements);}elseif(state->next>state->nelements-1){state->nelements=state->nelements*2;state->elements=(float*)repalloc(state->elements,offsetof(median_state,element)+sizeof(float)*data->nelements);}

We're still keeping the array nicely sized (still 2^N bytes), so perfect, right? Well, no. What gets allocated is the whole structure, and sadly that's always (2^N + 8B), because the two integers are part of the chunk. So we'll always get the perfect size + 2B. Which pretty much says we'll get 2x the necessary chunk size (because of the overflowing 8B). But this time we're guaranteed to waste the upper half of the chunk (except the first 8B). That kinda sucks.

There are two ways to fix this - either by allocating the array separately (which is what I did in count_distinct) or keeping the total size 2^N (and tweaking the nelements appropriately).

Creating many small contexts

Sometimes "less is more" and it certainly holds for memory contexts. A nice example is array_agg - an aggregate function that accumulates all the values into an array. The heavylifting is done by accumArrayResult in arrayfuncs.c. This piece of code performs initialization when the first value is passed to the function:

if(astate==NULL){/* First time through --- initialize *//* Make a temporary context to hold all the junk */arr_context=AllocSetContextCreate(rcontext,"accumArrayResult",ALLOCSET_DEFAULT_MINSIZE,ALLOCSET_DEFAULT_INITSIZE,ALLOCSET_DEFAULT_MAXSIZE);oldcontext=MemoryContextSwitchTo(arr_context);astate=(ArrayBuildState*)palloc(sizeof(ArrayBuildState));astate->mcontext=arr_context;astate->alen=64;/* arbitrary starting array size */astate->dvalues=(Datum*)palloc(astate->alen*sizeof(Datum));astate->dnulls=(bool*)palloc(astate->alen*sizeof(bool));astate->nelems=0;astate->element_type=element_type;get_typlenbyvalalign(element_type,&astate->typlen,&astate->typbyval,&astate->typalign);}

It's slightly complicated because it sets a lot of values in the aggregate state, but apparently it preallocates space for 64 elements (astate->alen), which is 512B on 64-bit architectures. It then properly doubles the size (not shown here bor brevity). Perfect, right? What could go wrong?

Well, the first thing the code actually does is creating a dedicated memory context (for this group), and it uses ALLOCSET_DEFAULT_INITSIZE as the initial block size. And ALLOCSET_DEFAULT_INITSIZE is 8kB, so on the first palloc, this memory context allocates 8kB block. The fact that we wanted to preallocate space for only 64 elements is irrelevant - we'll get 16x that.

Now, imagine aggregation with many distinct groups. Each group will get 8kB, even though there may be a single item in the array. And we actually get bug reports related to this.

What makes this even worse is that keeping per-group memory contexts makes it impossible to reuse chunks across groups.

Gradually growing request sizes

The last issue is quite different from the previous ones, because it's about (in)efficient chunk reuse.

As I mentioned, chunks are not really freed - instead they're moved to a freelist for later reuse. So when you do palloc(50) you'll get 64B chunk, and when you do pfree(chunk) it'll be moved to a freelist and eventually used for other requests needing 64B chunks.

But what happens if the requested sizes only grow? Consider for example this example:

inti=0,j=0;size_tcurrent_size=8;intnchunks=100;intnloops=10;char*tmp[nchunks];for(i=0;i<nchunks;i++)tmp[i]=palloc(current_size);for(i=0;i<nloops;i++){current_size*=2;for(j=0;j<nchunks;j++)tmp[j]=repalloc(tmp[j],current_size);}

So, how much memory is allocated at the end? There are 100 chunks, and the final chunk size is 8kB. So it has to be 800kB, right?

Actually, it's about double that, because none of the smaller chunks will ever be reused. The fact that the request sizes only grow prevents chunk reuse - the chunks will get stuck in the freelists until the memory context is destroyed.

It's however true that this overhead is bounded (it can't be higher than 100%, because it's bounded by a sum of infinite series 1/(2^k)), and most workloads actually mix requests of various sizes (making the reuse possible).

Summary

Don't try to be overly smart - follow the 2^N rule by allocating properly sized pieces and doubling the size when needed.
Where applicable, consider using dense allocation (as for example compact_palloc in spell.c).

↧

Payal Singh: Changing Owner of Multiple Database Objects

September 30, 2014, 1:12 pm

≫ Next: Craig Kerstiens: A simple guide for DB migrations

≪ Previous: Tomas Vondra: Examples of palloc overhead

A while ago a got a task to change the owner of a group of functions. While the number of functions wasn't too high, it was still enough that I began looking at ways to change the owner in a batch, instead of having to manually change it for each function.
In case of other database objects, changing owners is fairly simple. It can be accomplished in two steps:

1. Get list of all tables/sequences/views:

payal@testvagrant:~$ psql -qAt -c "SELECT 'ALTER TABLE '||schemaname||'.'||tablename||' OWNER TO new_owner;' FROM pg_tables WHERE schemaname = 'payal'" > test.txt

This will give us the following file:

payal@testvagrant:~$ cat test.txt
ALTER TABLE payal.new_audit_users OWNER TO new_owner;
ALTER TABLE payal.v_count_states OWNER TO new_owner;
ALTER TABLE payal.test OWNER TO new_owner;
ALTER TABLE payal.old_audit_users OWNER TO new_owner;
ALTER TABLE payal.old_audit OWNER TO new_owner;
ALTER TABLE payal.adwords_dump OWNER TO new_owner;
ALTER TABLE payal.affiliate OWNER TO new_owner;
ALTER TABLE payal.new_affiliate OWNER TO new_owner;
ALTER TABLE payal.partest OWNER TO new_owner;
ALTER TABLE payal.audit_test OWNER TO new_owner;
ALTER TABLE payal.batatawada OWNER TO new_owner;
ALTER TABLE payal.dup_key_err OWNER TO new_owner;
ALTER TABLE payal.new_audit OWNER TO new_owner;

2. Now all that is needed is to run this file with psql:

payal@testvagrant:~$ psql < test.txt
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE

That simple! An alternate solution can be found here.

However, things get a little tricky with functions due to argument specifications. Basically, one needs to specify a function's arguments along with the function name to alter it. For example:

ALTER FUNCTION hstore.tconvert(text, text) OWNER TO hstore;

Using the method described above to changed owner for tables, you cannot get the function arguments from pg_proc. Instead, postgres has a function pg_get_function_identity_arguments(func_oid) that RhodiumToad told me about in #postgresql IRC channel. This function returns all arguments of a function. So, we can run a query like:

payal@testvagrant:~$ psql -qAXt -c "select 'ALTER FUNCTION ' || n.nspname || '.' || p.proname || '(' || pg_catalog.pg_get_function_identity_arguments(p.oid) || ') OWNER TO hstore;' from pg_proc p, pg_namespace n where p.pronamespace = n.oid and n.nspname = 'hstore'" -o alterfunctions.sql postgres

Which gets a list of all functions in the hstore schema with arguments:

payal@testvagrant:~$ tail -10 alterfunctions.sql
ALTER FUNCTION hstore.ghstore_compress(internal) OWNER TO hstore;
ALTER FUNCTION hstore.ghstore_decompress(internal) OWNER TO hstore;
ALTER FUNCTION hstore.ghstore_penalty(internal, internal, internal) OWNER TO hstore;
ALTER FUNCTION hstore.ghstore_picksplit(internal, internal) OWNER TO hstore;
ALTER FUNCTION hstore.ghstore_union(internal, internal) OWNER TO hstore;
ALTER FUNCTION hstore.ghstore_same(internal, internal, internal) OWNER TO hstore;
ALTER FUNCTION hstore.ghstore_consistent(internal, internal, integer, oid, internal) OWNER TO hstore;
ALTER FUNCTION hstore.gin_extract_hstore(internal, internal) OWNER TO hstore;
ALTER FUNCTION hstore.gin_extract_hstore_query(internal, internal, smallint, internal, internal) OWNER TO hstore;
ALTER FUNCTION hstore.gin_consistent_hstore(internal, smallint, internal, integer, internal, internal) OWNER TO hstore;

Now we can just run this file with psql:

payal@testvagrant:~$ psql < alterfunctions.sql
ALTER FUNCTION
ALTER FUNCTION
ALTER FUNCTION
ALTER FUNCTION
ALTER FUNCTION
ALTER FUNCTION
ALTER FUNCTION
ALTER FUNCTION
ALTER FUNCTION
ALTER FUNCTION
ALTER FUNCTION
ALTER FUNCTION

↧

Craig Kerstiens: A simple guide for DB migrations

October 1, 2014, 12:00 am

≫ Next: Hubert 'depesz' Lubaczewski: Waiting for 9.5 – Row-Level Security Policies (RLS)

≪ Previous: Payal Singh: Changing Owner of Multiple Database Objects

Most web applications will add/remove columns over time. This is extremely common early on and even mature applications will continue modifying their schemas with new columns. An all too common pitfall when adding new columns is setting a not null constraint in Postgres.

Not null constraints

What happens when you have a not null constraint on a table is it will re-write the entire table. Under the cover Postgres is really just an append only log. So when you update or delete data it’s really just writing new data. This means when you add a column with a new value it has to write a new record. If you do this requiring columns to not be null then you’re re-writing your entire table.

Where this becomes problematic for larger applications is it will hold a lock preventing you from writing new data during this time.

A better way

Of course you may want to not allow nulls and you may want to set a default value, the problem simply comes when you try to do this all at once. The safest approach at least in terms of uptime for your table –> data –> appliction is to break apart these steps.

Start by simply adding the column with allowing nulls but setting a default value
Run a background job that will go and retroactively update the new column to your default value
Add your not null constraint.

Yes it’s a few extra steps, but I can say from having walked through this with a number of developers and their apps it makes for a much smoother process for making changes to your apps.

↧

Hubert 'depesz' Lubaczewski: Waiting for 9.5 – Row-Level Security Policies (RLS)

October 2, 2014, 12:28 pm

≫ Next: Josh Berkus: JSONB and 9.4: Move Slow and Break Things

≪ Previous: Craig Kerstiens: A simple guide for DB migrations

On 19th of September, Stephen Frost committed patch: Row-Level Security Policies (RLS) Building on the updatable security-barrier views work, add the ability to define policies on tables to limit the set of rows which are returned from a query and which are allowed to be added to a table. Expressions defined by the policy […]

↧

Josh Berkus: JSONB and 9.4: Move Slow and Break Things

October 2, 2014, 4:30 pm

≫ Next: gabrielle roth: PDXPUG: October meeting

≪ Previous: Hubert 'depesz' Lubaczewski: Waiting for 9.5 – Row-Level Security Policies (RLS)

If you've been paying any attention at all, you're probably wondering why 9.4 isn't out yet. The answer is that we had to change JSONB at the last minute, in a way that breaks compatibility with earlier betas.

In August, a beta-testing user reported that we had an issue with JSONB not compressing well. This was because of the binary structure of key offsets at the beginning of the JSONB value, and the affects were dramatic; in worst cases, JSONB values were 150% larger than comparable JSON values. We spent August through September revising the data structure and Heikki and Tom eventually developed one which gives better compressibility without sacrificing extraction speed.

I did a few benchmarks on the various JSONB types. We're getting a JSONB which is both faster and smaller than competing databases, so it'll be worth the wait.

However, this means that we'll be releasing an 9.4beta3 next week, whose JSONB type will be incompatible with prior betas; you'll have to dump and reload if you were using Beta 1 or Beta 2 and have JSONB data. It also means a delay in final release of 9.4.

↧

gabrielle roth: PDXPUG: October meeting

October 2, 2014, 6:27 pm

≫ Next: Michael Paquier: Postgres 9.5 feature highlight: Row-Level Security and Policies

≪ Previous: Josh Berkus: JSONB and 9.4: Move Slow and Break Things

When: 6-8pm Thu Oct 16, 2014
Where: Iovation
What: PgOpen Recap (gabrielle, Mark, John M); New Relic Instrumentation of Pg Queries (Andrew)

Please note the new earlier meeting time! We’ll try this over the winter.

Two topics this month: PgOpen attendees will discuss highlights of that conference, and Andrew will talk about some New Relic-y stuff.

Our meeting will be held at Iovation, on the 32nd floor of the US Bancorp Tower at 111 SW 5th (5th & Oak). It’s right on the Green & Yellow Max lines. Underground bike parking is available in the parking garage; outdoors all around the block in the usual spots. No bikes in the office, sorry!

Elevators open at 5:45 and building security closes access to the floor at 6:30.

↧

Michael Paquier: Postgres 9.5 feature highlight: Row-Level Security and Policies

October 3, 2014, 7:20 am

≫ Next: Andrew Dunstan: Towards a PostgreSQL Benchfarm

≪ Previous: gabrielle roth: PDXPUG: October meeting

Row-level security is a new feature of PostgreSQL 9.5 that has been introduced by this commit:

commit: 491c029dbc4206779cf659aa0ff986af7831d2ff
author: Stephen Frost <sfrost@snowman.net>
date: Fri, 19 Sep 2014 11:18:35 -0400
Row-Level Security Policies (RLS)

Building on the updatable security-barrier views work, add the
ability to define policies on tables to limit the set of rows
which are returned from a query and which are allowed to be added
to a table.  Expressions defined by the policy for filtering are
added to the security barrier quals of the query, while expressions
defined to check records being added to a table are added to the
with-check options of the query.

Behind this jargon is a feature that could be defined in short words as a complementary permission manager of GRANT and REVOKE that allows controlling at row level which tuples can be retrieved for a read query or manipulated using INSERT, UPDATE or DELETE. This row control mechanism is controlled using a new query called CREATE POLICY (of course its flavor ALTER POLICY to update an existing policy and DROP POLICY to remove a policy exist as well). By default, tables have no restrictions in terms of how rows can be added and manipulated. However they can be made able to accept level restriction policies using ALTER TABLE and ENABLE ROW LEVEL SECURITY. Now, let's imagine the following table where a list of employees and their respective salaries can be read (salary is an integer as this is entirely fictive situation and refers to no real situation, quoique...):

=# CREATE TABLE employee_data (id int,
       employee text,
       salary int,
       phone_number text);
CREATE TABLE
=# CREATE ROLE ceo;
CREATE ROLE
=# CREATE ROLE jeanne;
CREATE ROLE
=# CREATE ROLE bob;
CREATE ROLE
=# INSERT INTO employee_data VALUES (1, 'ceo', 300000, '080-7777-8888');
INSERT 0 1
=# INSERT INTO employee_data VALUES (2, 'jeanne', 1000, '090-1111-2222');
INSERT 0 1
=# INSERT INTO employee_data VALUES (3, 'bob', 30000, '090-2222-3333');
INSERT 0 1

Now let's set some global permissions on this relation using GRANT. Logically, the CEO has a complete control (?!) on the grid of salary of his employees.

=# GRANT SELECT, INSERT, UPDATE, DELETE ON employee_data TO ceo;
GRANT

A normal employee can have information access to all the information, and can update as well his/her phone number or even his/her name:

=# GRANT SELECT (id, employee, phone_number, salary)
   ON employee_data TO public;
GRANT
=# GRANT UPDATE (employee, phone_number) ON employee_data TO public;
GRANT

As things stand now though, everybody is able to manipulate other's private data and not only his own. For example Jeanne can update her CEO's name:

=# SET ROLE jeanne;
SET
=> UPDATE employee_data
   SET employee = 'Raise our salaries -- Signed: Jeanne'
   WHERE employee = 'CEO';
UPDATE 1

Row-level security can be used to control with more granularity what are the rows that can be manipulated for a set of circumstances that are defined with a policy. First RLS must be enabled on the given table:

=# ALTER TABLE employee_data ENABLE ROW LEVEL SECURITY;
ALTER TABLE

Note that if there are no policies defined and that RLS is enabled normal users that even have GRANT access to a certain set of operations can do nothing:

=> set role ceo;
SET
=> UPDATE employee_data SET employee = 'I am God' WHERE id = 1;
UPDATE 0

So it is absolutely mandatory to set policies to bring the level of control wanted for a relation if RLS is in the game. First a policy needs to be defined to let the CEO have a complete access on the table (the default being FOR ALL all the operations are authorized this way to the CEO), and luckily Jeanne just got a promotion:

=# CREATE POLICY ceo_policy ON employee_data TO ceo
   USING (true) WITH CHECK (true);
CREATE POLICY
=# SET ROLE ceo;
SET
=> UPDATE employee_data SET salary = 5000 WHERE employee = 'jeanne' ;
UPDATE 1
=> SELECT * FROM employee_data ORDER BY id;
 id | employee | salary | phone_number
----+----------+--------+---------------
  1 | ceo      | 300000 | 080-7777-8888
  2 | jeanne   |   5000 | 090-1111-2222
  3 | bob      |  30000 | 090-2222-3333
(3 rows)

Even with SELECT access allowed through GRANT, Bob and Jeanne cannot view any row so they cannot view even their own information. This can be solved with a new policy (note in this case the clause USING that can be be used to define a boolean expression on which the rows are filtered):

=# CREATE POLICY read_own_data ON employee_data
   FOR SELECT USING (current_user = employee);
CREATE POLICY
=# SET ROLE jeanne;
SET
=> SELECT * FROM employee_data;
 id | employee | salary | phone_number
----+----------+--------+---------------
  2 | jeanne   |   5000 | 090-1111-2222
(1 row)

A user should be able to modify his own information as well, and note now the WITH CHECK clause that can be used to check the validity of a row once it has been manipulated. In this case, the employee name cannot be updated to a value other than the role name (well it was better not to give UPDATE access with GRANT to this column but this would have made an example above invalid...), and the new phone number cannot be NULL (have you though that the ceo can actually set his phone number to NULL, something less flexible with CHECK at relation level):

=# CREATE POLICY modify_own_data ON employee_data
   FOR UPDATE USING (current_user = employee)
   WITH CHECK (employee = current_user AND phone_number IS NOT NULL);
CREATE POLICY
=# SET ROLE jeanne;
SET
=> UDATE employee_data SET id = 10; -- blocked by GRANT
ERROR:  42501: permission denied for relation employee_data
LOCATION:  aclcheck_error, aclchk.c:3371
=> UPDATE employee_data SET phone_number = NULL; -- blocked by policy 
ERROR:  44000: new row violates WITH CHECK OPTION for "employee_data"
DETAIL:  Failing row contains (2, jeanne, 5000, null).
LOCATION:  ExecWithCheckOptions, execMain.c:1684
=> UPDATE employee_data SET phone_number = '1-1000-2000'; -- OK
UPDATE 1

Using this new policy, Jeanne has updated her phone number, and the CEO can check that freely:

=> SET ROLE ceo;
SET
=> SELECT * FROM employee_data ORDER BY id;
 id | employee | salary | phone_number
----+----------+--------+---------------
  1 | ceo      | 300000 | 080-7777-8888
  2 | jeanne   |   5000 | 1-1000-2000
  3 | bob      |  30000 | 090-2222-3333
(3 rows)

So, while GRANT and REVOKE offer control of the actions that can be done on a relation for a set of users vertically (control of columns), RLS offers the possibility to control things horizontally for each record so when using this feature be sure to use both together and wisely.

↧

Andrew Dunstan: Towards a PostgreSQL Benchfarm

October 3, 2014, 10:37 am

≫ Next: Kirk Roybal: DFW PUG Meetup November 5, 2014

≪ Previous: Michael Paquier: Postgres 9.5 feature highlight: Row-Level Security and Policies

For years I have been wanting to set up a farm of machines, modelled after the buildfarm, that will run some benchmarks and let us see performance regressions. Today I'm publishing some progress on that front, namely a recipe for vagrant to set up an instance on AWS of the client I have been testing with. All this can be seen on the PostgreSQL Buildfarm Github Repository on a repo called aws-vagrant-benchfarm-client. The README explains how to set it up. The only requirement is that you have vagrant installed and the vagrant-aws provider set up (and, of course, an Amazon AWS account to use).

Of course, we don't want to run members of the benchfarm on smallish AWS instances. But this gives me (and you, if you want to play along) something to work on, and the provisioning script documents all the setup steps rather than relying on complex instructions.

The provisioner installs a bleeding edge version of the buildfarm client's experimental Pgbench module, which currently only exists on the "benchfarm" topic branch. This module essentially runs Greg Smith's pgbench-tools suite, gets the results from the results database's "tests" table, and bundles it as a CSV for upload to the server.

Currently the server does nothing with it. This will just look like another buildfarm step. So the next thing to do is to get the server to start producing some pretty and useful graphs. Also, we need to decide what else we might want to capture.

↧