Feng Tian: TPCH on PostgreSQL (Part 2)

November 4, 2014, 12:17 am

≫ Next: Jehan-Guillaume (ioguix) de Rorthais: Btree bloat query - part 4

≪ Previous: gabrielle roth: PgConf.EU recap

CTE is a great feature.

We also need to rewrite Q20 extensively. We haven't really tried out best, and we believe we can unroll the three subqueries into outer joins, but it is mind boggling. The WITH clause (Common Table Expression) comes to rescue. CTE allows one to write simple, clear, easy to understand steps in SQL. This is very much like writing query in Pig [1] (or Quel, for old timers), except with a better syntax (and it is standard).

What one need to beware, is that WITH clause is a optimization barrier in PostgreSQL, that is, the planner will optimize each CTE separately and will not consider global optimizations. This is both good and bad. One can use this feature to reduce planner search space and "force" a plan (for example, join ordering). But on the other hand, more than often, the optimizer (or planner) is smarter than the programmer.

Let's look at the following simple example,

create table t as select i, i + 1 as j from generate_series(1, 10000000) i;
create index ti on t(i);

ftian=# explain select * from t where i = 10 and j = 100;

QUERY PLAN

-----------------------------------------------------------------------

Bitmap Heap Scan on t (cost=927.50..48029.26 rows=250 width=8)

Recheck Cond: (i = 10)

Filter: (j = 100)

-> Bitmap Index Scan on ti (cost=0.00..927.43 rows=50000 width=0)

Index Cond: (i = 10)

(5 rows)

ftian=# explain with x as (select * from t where j = 100) select * from x where i = 10;

QUERY PLAN

--------------------------------------------------------------

CTE Scan on x (cost=169247.81..169247.83 rows=1 width=8)

Filter: (i = 10)

CTE x

-> Seq Scan on t (cost=0.00..169247.81 rows=1 width=8)

Filter: (j = 100)

(5 rows)

Note that in the second query, planner is not able to utilize the index.

Closely related to WITH clause is view. One can create views one by one to achieve what WITH clauses do, but this is much more intrusive in the sense that it modifies catalog. On the other had, view is not an optimization barrier. For example,

create view vt as select * from t where j = 100;

ftian=# explain select * from vt where i = 10;

QUERY PLAN

------------------------------------------------------------

Index Scan using ti on t (cost=0.43..8.46 rows=1 width=8)

Index Cond: (i = 10)mi

Filter: (j = 100)

(3 rows)

The index ti is used, no problem. But can someone explain why this produces a different plan? It uses Index Scan instead of Bitmap Index Scan. Well, optimizer is really fun.

[1] http://pig.apache.org

↧

Jehan-Guillaume (ioguix) de Rorthais: Btree bloat query - part 4

November 4, 2014, 3:39 am

≫ Next: Bruce Momjian: 2015 Postgres Conferences

≪ Previous: Feng Tian: TPCH on PostgreSQL (Part 2)

Thanks to the various PostgreSQL environments we have under monitoring at Dalibo, these Btree bloat estimation queries keeps challenging me occasionally because of statistics deviation…or bugs.

For people who visit this blog for the first time, don’t miss the three previous parts, stuffed with some interesting infos about these queries and BTree indexes: part 1, part 2 and part 3.

For people in a hurry, here are the links to the queries:

for 7.4: https://gist.github.com/ioguix/dfa41eb0ef73e1cbd943
for 8.0 and 8.1: https://gist.github.com/ioguix/5f60e24a77828078ff5f
for 8.2 and more: https://gist.github.com/ioguix/c29d5790b8b93bf81c27

Columns has been ignored

In two different situations, some index fields were just ignored by the query:

after renaming the field in the table
if the index field was an expression

I cheated a bit for the first fix, looking at psql’s answer to this question (thank you -E).

The second one was an easy fix, but sadly only for version 8.0 and more. It seems to me there’s no solution for 7.4.

These bugs have the same results: very bad estimation. An index field is ignored in both cases, s the bloat sounds much bigger with the old version of the query. Here is a demo with an index on expression:

postgres@pagila=#createindextest_expressionontest(rental_id,md5(rental_id::text));CREATE INDEXpostgres@pagila=#analyze;ANALYZEpostgres@pagila=#\iold/btree_bloat.sql-20141022 current_database | schemaname | tblname |     idxname     | real_size | estimated_size | bloat_size |        bloat_ratio         | is_na ------------------+------------+---------+-----------------+-----------+----------------+------------+----------------------------+------- pagila           | public     | test    | test_expression |    974848 |         335872 |     638976 |        65.5462184873949580 | f

Most of this 65% bloat estimation are actually the data of the missing field. The result is much more coherent with the latest version of the query for a freshly created index, supposed to have around 10% of bloat as showed in the 2nd query:

postgres@pagila=#\isql/btree_bloat.sql current_database | schemaname | tblname |     idxname     | real_size | estimated_size | bloat_size |   bloat_ratio    | is_na ------------------+------------+---------+-----------------+-----------+----------------+------------+------------------+------- pagila           | public     | test    | test_expression |    974848 |         851968 |     122880 | 12.6050420168067 | fpostgres@pagila=#SELECTrelname,100-(stattuple.pgstatindex(relname)).avg_leaf_densityASbloat_ratioFROMpg_classWHERErelname='test_expression';     relname     | bloat_ratio -----------------+------------- test_expression |       10.33

Wrong estimation for varlena types

After fixing the query for indexes on expression, I noticed some negative bloat estimation for the biggest ones: the real index was smaller than the estimated one!

postgres@pagila=#createtabletest3asselectifromgenerate_series(1,10000000)i;SELECT 10000000postgres@pagila=#createindexontest3(i,md5(i::text));CREATE INDEXpostgres@pagila=#\i~/sql/old/btree_bloat.sql-20141027 current_database | schemaname | tblname |     idxname     | real_size | estimated_size | bloat_size |     bloat_ratio     | is_na ------------------+------------+---------+-----------------+-----------+----------------+------------+---------------------+------- pagila           | public     | test3   | test3_i_md5_idx | 590536704 |      601776128 |  -11239424 | -1.9032557881448805 | f

In this version of the query, I am computing and adding the headers length of varlena types (text, bytea, etc) to the statistics(see part 3). I was wrong.

Taking the “text” type as example, PostgreSQL adds a one byte header to the value if it is not longer than 127, and a 4 bytes one for bigger ones. Looking closer to the statistic values because of this negative bloat, I realized that the headers was already added to them. As a demo, take a md5 string of 32 bytes long. In the following results, we can see the average length from pg_stats is 32+1 for one md5, and 4*32+4 for a string of 4 concatenated md5, supposed to be 128 byte long:

postgres@pagila=#createtabletest2asselecti,md5(i::text),repeat(md5(i::text),4)fromgenerate_series(1,5)i;SELECT 5postgres@pagila=#analyzetest2;ANALYZEpostgres@pagila=#selecttablename,attname,avg_widthfrompg_statswheretablename='test2'; tablename | attname | avg_width -----------+---------+----------- test2     | i       |         4 test2     | md5     |        33 test2     | repeat  |       132

After removing this part of the query, stats for test3_i_md5_idx are much better:

postgres@pagila=#SELECTrelname,100-(stattuple.pgstatindex(relname)).avg_leaf_densityASbloat_ratioFROMpg_classWHERErelname='test3_i_md5_idx';     relname     | bloat_ratio -----------------+------------- test3_i_md5_idx |       10.01postgres@pagila=#\i~/sql/old/btree_bloat.sql-20141028 current_database | schemaname | tblname |     idxname     | real_size | estimated_size | bloat_size |     bloat_ratio     | is_na ------------------+------------+---------+-----------------+-----------+----------------+------------+---------------------+------- pagila           | public     | test3   | test3_i_md5_idx | 590536704 |      521535488 |   69001216 | 11.6844923495221052 | f

This is a nice bug fix AND one complexity out of the query. Code simplification is always a good news :)

Adding a bit of Opaque Data

When studying the Btree layout, I forgot about one small non-data area in index pages: the “Special space”, aka. “Opaque Data” in code sources. The previous bug took me back on this doc page where I remembered I should probably pay attention to this space.

This is is a small space on each pages reserved to the access method so it can store whatever it needs for its own purpose. As instance, in the case of a Btree index, this “special space” is 16 bytes long and used (among other things) to reference both siblings of the page in the tree. Ordinary tables have no opaque data, so no special space (good, I ’ll not have to fix this bug in my Table bloat estimation query).

This small bug is not as bad for stats than previous ones, but fixing it definitely help the bloat estimation accuracy. Using the previous demo on test3_i_md5_idx, here is the comparison of real bloat, estimation without considering the special space and estimation considering it:

postgres@pagila=#SELECTrelname,100-(stattuple.pgstatindex(relname)).avg_leaf_densityASbloat_ratioFROMpg_classWHERErelname='test3_i_md5_idx';     relname     | bloat_ratio -----------------+------------- test3_i_md5_idx |       10.01postgres@pagila=#\i~/sql/old/btree_bloat.sql-20141028 current_database | schemaname | tblname |     idxname     | real_size | estimated_size | bloat_size |     bloat_ratio     | is_na ------------------+------------+---------+-----------------+-----------+----------------+------------+---------------------+------- pagila           | public     | test3   | test3_i_md5_idx | 590536704 |      521535488 |   69001216 | 11.6844923495221052 | fpostgres@pagila=#\i~/sql/btree_bloat.sql current_database | schemaname | tblname |     idxname     | real_size | estimated_size | bloat_size |   bloat_ratio    | is_na ------------------+------------+---------+-----------------+-----------+----------------+------------+------------------+------- pagila           | public     | test3   | test3_i_md5_idx | 590536704 |      525139968 |   65396736 | 11.0741187731491 | f

This is only an approximative 5% difference for the estimated size of this particular index.

Conclusion

I never mentioned it before, but these queries are used in check_pgactivity (a nagios plugin for PostgreSQL), under the checks “table_bloat” and “btree_bloat”. The latest version of this tool already include these fixes. I might write an article about “check_pgactivity” at some point.

As it is not really convenient for most of you to follow the updates on my gists, I keep writing here about my work on these queries. I should probably add some version-ing on theses queries now and find a better way to communicate about them at some point.

As a first step, after a discussion with (one of?) the author of pgObserver during the latest pgconf.eu, I added these links to the following PostgreSQL wiki pages:

Cheers, happy monitoring, happy REINDEX-ing!

↧

Bruce Momjian: 2015 Postgres Conferences

November 4, 2014, 10:45 am

≫ Next: Peter Eisentraut: Checking whitespace with Git

≪ Previous: Jehan-Guillaume (ioguix) de Rorthais: Btree bloat query - part 4

Due to the increased popularity of Postgres, conference organizers are more confident about future conferences and are announcing their conference dates earlier, perhaps also to attract speakers. These are the conferences already announced for 2015:

February, FOSDEM, Brussels (dedicated Postgres day)
February, PgConf.Russia, Moscow
February, SCALE, Los Angeles (dedicated Postgres day)
March, PGConf US, New York City
June, PGCon, Ottawa
July, PG Day'15 Russia, Saint Petersburg
October, PostgreSQL Conference Europe, Vienna
November, pgbr, Porto Alegre, Brazil

I don't remember ever seeing this many Postgres conferences scheduled this far in advance. (Most of these are listed on the Postgres Wiki Events page.) If you know of any other scheduled 2015 conferences, please post the details as a comment below.

↧

Peter Eisentraut: Checking whitespace with Git

November 4, 2014, 5:00 pm

≫ Next: Hubert 'depesz' Lubaczewski: Changes on explain.depesz.com

≪ Previous: Bruce Momjian: 2015 Postgres Conferences

Whitespace matters.

Git has support for checking whitespace in patches. git apply and git am have the option --whitespace, which can be used to warn or error about whitespace errors in the patches about to be applied. git diff has the option --check to check a change for whitespace errors.

But all this assumes that your existing code is cool, and only new changes are candidates for problems. Curiously, it is a bit hard to use those same tools for going back and checking whether an existing tree satisfies the whitespace rules applied to new patches.

The core of the whitespace checking is in git diff-tree. With the --check option, you can check the whitespace in the diff between two objects.

But how do you check the whitespace of a tree rather than a diff? Basically, you want

git diff-tree --check EMPTY HEAD

except there is no EMPTY. But you can compute the hash of an empty Git tree:

git hash-object -t tree /dev/null

So the full command is

git diff-tree --check $(git hash-object -t tree /dev/null) HEAD

If have this as an alias in my ~/.gitconfig:

[alias]
    check-whitespace = !git diff-tree --check $(git hash-object -t tree /dev/null) HEAD

Then running

git check-whitespace

can be as easy as running make or git commit.

↧

Hubert 'depesz' Lubaczewski: Changes on explain.depesz.com

November 5, 2014, 7:43 am

≫ Next: Marko Tiikkaja: PostgreSQL gotcha of the week, week 45

≪ Previous: Peter Eisentraut: Checking whitespace with Git

Uploaded new version to the server – straight from GitHub. There are two changes – one visible, and one not really. The invisible change, first, is one for people hosting explain.depesz.com on their own. As you perhaps know you can get sources of explain.depesz.com and install it on any box you want (as log as […]

↧

Marko Tiikkaja: PostgreSQL gotcha of the week, week 45

November 5, 2014, 2:46 pm

≫ Next: Marko Tiikkaja: PostgreSQL gotcha of the week, week 45

≪ Previous: Hubert 'depesz' Lubaczewski: Changes on explain.depesz.com

Something happened earlier this week (or maybe it was last week, I forget), and I was quite perplexed for a good 30 seconds, so I thought I'd try and make this problem a bit more well-known. Consider the following schema: CREATE TABLE blacklist ( person text ); INSERT INTO blacklist VALUES ('badguy'); CREATE TABLE orders ( orderid serial PRIMARY KEY, person text, amount numeric );

↧

Marko Tiikkaja: PostgreSQL gotcha of the week, week 45

November 5, 2014, 3:03 pm

≫ Next: Shaun M. Thomas: On PostgreSQL View Dependencies

≪ Previous: Marko Tiikkaja: PostgreSQL gotcha of the week, week 45

↧

Shaun M. Thomas: On PostgreSQL View Dependencies

November 5, 2014, 3:20 pm

≫ Next: Hubert 'depesz' Lubaczewski: How to install your own copy of explain.depesz.com

≪ Previous: Marko Tiikkaja: PostgreSQL gotcha of the week, week 45

As many seasoned DBAs might know, there’s one area that PostgreSQL still manages to be highly aggravating. By this, I mean the role views have in mucking up PostgreSQL dependencies. The part that annoys me personally, is that it doesn’t have to be this way.

Take, for example, what happens if you try to modify a VARCHAR column so that the column length is higher. We’re not changing the type, or dropping the column, or anything overly complicated. Yet we’re faced with this message:

ERROR:  cannot alter type of a column used by a view or rule
DETAIL:  rule _RETURN on view v_change_me depends on column "too_short"

Though PostgreSQL tells us which view and column prompted this error, that’s the last favor it provides. The only current way to fix this error is to drop the view, alter the column, then recreate the view. In a production 24/7 environment, this is extremely problematic. The system I work with handles over two-billion queries per day; there’s no way I’m dropping a view that the platform depends on, even in a transaction.

This problem is compounded when views depend on other views. The error doesn’t say so, but I defined another view named v_change_me_too that depends on v_change_me, yet I would never know it by the output PostgreSQL generated. Large production systems can have dozens, or even hundreds of views that depend on complex hierarchies of tables and other views. Yet there’s no built-in way to identify these views, let alone modify them safely.

If you want to follow along, this is the code I used to build my test case:

CREATE TABLE change_me ( too_short VARCHAR(30) );
CREATE VIEW v_change_me AS SELECT * FROM change_me;
CREATE VIEW v_change_me_too AS SELECT * FROM v_change_me;

And here’s the statement I used to try and make the column bigger:

ALTER TABLE change_me ALTER too_short TYPE VARCHAR(50);

It turns out we can solve this for some cases, though it takes a very convoluted path. The first thing we need to do is identify all of the views in the dependency chain. To do this, we need a recursive query. Here’s one that should find all the views in our sample chain, starting with the table itself:

WITH RECURSIVE vlist AS (
    SELECT c.oid::REGCLASS AS view_name
      FROM pg_class c
     WHERE c.relname = 'change_me'
     UNION ALL
    SELECT DISTINCT r.ev_class::REGCLASS AS view_name
      FROM pg_depend d
      JOIN pg_rewrite r ON (r.oid = d.objid)
      JOIN vlist ON (vlist.view_name = d.refobjid)
     WHERE d.refobjsubid != 0
)
SELECT * FROM vlist;

If we execute that query, both v_change_me and v_change_me_too will show up in the results. Keep in mind that in actual production systems, this list can be much longer. For systems that can survive downtime, this list can be passed to pg_dump to obtain all of the view definitions. That will allow a DBA to drop the views, modify the table, then accurately recreate them.

For simple cases where we’re just extending an existing column, we can take advantage of the fact the pg_attribute catalog table allows direct manipulation. In PostgreSQL, TEXT-type columns have a length 4-bytes longer than the column limit. So we simply reuse the recursive query and extend that length:

WITH RECURSIVE vlist AS (
    SELECT c.oid::REGCLASS AS view_name
      FROM pg_class c
     WHERE c.relname = 'change_me'
     UNION ALL
    SELECT DISTINCT r.ev_class::REGCLASS AS view_name
      FROM pg_depend d
      JOIN pg_rewrite r ON (r.oid = d.objid)
      JOIN vlist ON (vlist.view_name = d.refobjid)
     WHERE d.refobjsubid != 0
)
UPDATE pg_attribute a
   SET a.atttypmod = 50 + 4
  FROM vlist
 WHERE a.attrelid = vlist.view_name
   AND a.attname = 'too_short';

Now, this isn’t exactly a perfect solution. If views alias the column name, things get a lot more complicated. We have to modify the recursive query to return both the view name, and the column alias. Unfortunately the pg_depend view always sets the objsubid column to 0 for views. The objsubid column is used to determine which which column corresponds to the aliased column.

Without having this value, it becomes impossible to know what to modify in pg_attribute for the views. In effect, instead of being a doubly-linked list, pg_depend is a singly-linked list we can only follow backwards. So we can discover what the aliases depend on, but not what the aliases are. I can’t really think of any reason this would be set for tables, but not for views.

This means, of course, that large production systems will still need to revert to the DROP -> ALTER -> CREATE route for column changes to dependent views. But why? PostgreSQL knows the entire dependency chain. Why is it impossible to modify these in an atomic transaction context? If I have one hundred views on a table, why do I have to drop all of them before modifying the table? And, again, the type of modification in this example is extremely trivial; we’re not going from a TEXT to an INT, or anything that would require drastically altering the view logic.

For highly available databases, this makes it extremely difficult to use PostgreSQL without some type of short outage. Column modifications, while not common, are a necessary evil. Since it would be silly to recommend never using views, we have to live with downtime imposed by the database software. Now that PostgreSQL is becoming popular in enterprise settings, issues like this are gaining more visibility.

Hopefully this is one of those easy fixes they can patch into 9.5 or 9.6. If not, I can see it hampering adoption.

↧

Hubert 'depesz' Lubaczewski: How to install your own copy of explain.depesz.com

November 5, 2014, 10:22 pm

≫ Next: Michael Paquier: pgmpc: mpd client for Postgres

≪ Previous: Shaun M. Thomas: On PostgreSQL View Dependencies

There are some cases where you might want to get your own copy of explain.depesz.com. You might not trust me with your explains. You might want to use it without internet access. Or you just want to play with it, and have total control over the site. Installing, while obvious to me, and recently described […]

↧

Michael Paquier: pgmpc: mpd client for Postgres

November 5, 2014, 10:57 pm

≫ Next: Josh Berkus: We need a webapp benchmark

≪ Previous: Hubert 'depesz' Lubaczewski: How to install your own copy of explain.depesz.com

Have you ever heard about mpd? It is an open source music player that works as a server-side application playing music, as well as in charge of managing the database of songs, playlists. It is as well able to do far more fancy stuff... Either way, mpd has a set of client APIs making possible to control the operations on server called libmpdclient, library being used in many client applications available in the wild, able to interact with a remote mpd instance. The most used being surely mpc, ncmpc and gmpc. There are as well more fancy client interfaces like for example libmpdee.el, an elisp module for emacs. Now, PostgreSQL has always lacked a dedicated client interface, that's where pgmpc fills the need (is there one btw?), by providing a set of SQL functions able to interact with an mpd instance, so you can control your music player directly with Postgres.

In order to compile it, be sure to have libmpdclient installed on your system. Note as well that pgmpc is shaped as an extension, so once its source installed it needs to be enabled on a Postgres server using CREATE EXTENSION.

Once installed, this list of functions, whose names are inspired from the existing interface of mpc, are available.

=# \dx+ pgmpc
   Objects in extension "pgmpc"
        Object Description
----------------------------------
 function mpd_add(text)
 function mpd_clear()
 function mpd_consume()
 function mpd_load(text)
 function mpd_ls()
 function mpd_ls(text)
 function mpd_lsplaylists()
 function mpd_next()
 function mpd_pause()
 function mpd_play()
 function mpd_playlist()
 function mpd_playlist(text)
 function mpd_prev()
 function mpd_random()
 function mpd_repeat()
 function mpd_rm(text)
 function mpd_save(text)
 function mpd_set_volume(integer)
 function mpd_single()
 function mpd_status()
 function mpd_update()
 function mpd_update(text)
 (22 rows)

Currently, what can be done is to control the player, the playlists, and to get back status of the player. So the interface is sufficient enough for basic operations with mpd, enough to control mpd while being still connected to your favorite database.

Also, the connection to the instance of mpd can be controlled with the following GUC parameters that can be changed by the user within a single session:

pgmpc.mpd_host, address to connect to mpd instance. Default is "localhost" This can be set as a local Unix socket as well.
pgmpc.mpd_port, port to connect to mpd instance. Default is 6600.
pgmpc.mpd_password, password to connect to mpd instance. That's optional and it is of course not recommended to write it blankly in postgresql.conf.
pgmpc.mpd_timeout, timeout switch for connection obtention. Default is 10s.

In any case, the code of this module is available in pg_plugins on github. So feel free to send pull requests or comments about this module there. Patches to complete the existing set of functions are as well welcome.

↧

Josh Berkus: We need a webapp benchmark

November 6, 2014, 4:10 pm

≫ Next: gabrielle roth: PDXPUG: November meeting in two weeks

≪ Previous: Michael Paquier: pgmpc: mpd client for Postgres

I've been doing some comparative testing on different cloud platform's hosting of PostgreSQL. And one of the deficiencies in this effort is the only benchmarking tool I have is pgbench, which doesn't reflect the kinds of workloads people would want to run on cloud hosting. Don't get me wrong, pgbench does everything you could imagine with the simple Wisconsin benchmark, including statistics and sampling. But the core benchmark is still something which doesn't look much like the kind of Rails and Django apps I deal with on a daily basis.

There's also TPCC-js, which is more sophisticated, but is ultimately still a transactional, back-office OLTP benchmark.

So I'm thinking of developing a "webapp" benchmark. Here's what I see as concepts for such a benchmark:

Read-mostly
No multi-statement transactions
Defined "users" concept with logins and new user registration
Needs a "sessions" table which is frequently updated
Read-write, read-only and session database connections should be separable, in order to test load-balancing optimization.
Queries counting, sorting and modifying content
Measured unit of work is the "user session" which would contain some content lookups and minor updates ("likes").

Now, one of the big questions is whether we should base this benchmark on the idea of a social networking (SN) site. I think we should; SN sites test a number of things, including locking and joins which might not be exercised by other types of applications (and aren't by pgbench). What do you think? Does anyone other than me want to work on this?

↧

gabrielle roth: PDXPUG: November meeting in two weeks

November 6, 2014, 7:24 pm

≫ Next: Jim Mlodgenski: Synchronous Commit

≪ Previous: Josh Berkus: We need a webapp benchmark

When: 6-8pm Thu Nov 20, 2014
Where: Iovation
What: 9.4 party!

As discussed at last month’s meeting, we’re going to check out some of the new 9.4 features. It will help if you already have 9.4 installed on your laptop, but if you’re new & don’t know how to do that, just show up & we’ll help you out.

If you’re a Vagrant user, try this: https://github.com/softwaredoug/vagrant-postgres-9.4

Our meeting will be held at Iovation, on the 32nd floor of the US Bancorp Tower at 111 SW 5th (5th & Oak). It’s right on the Green & Yellow Max lines. Underground bike parking is available in the parking garage; outdoors all around the block in the usual spots. No bikes in the office, sorry!

Elevators open at 5:45 and building security closes access to the floor at 6:30.

The building is on the Green & Yellow Max lines. Underground bike parking is available in the parking garage; outdoors all around the block in the usual spots.

See you there!

↧

Jim Mlodgenski: Synchronous Commit

November 10, 2014, 12:00 pm

≫ Next: Feng Tian: TPCH on PostgreSQL (part 3)

≪ Previous: gabrielle roth: PDXPUG: November meeting in two weeks

While I was at PGConf.EU a couple of weeks ago in Madrid, I attended a talk by Grant McAlister discussing Amazon RDS for PostgreSQL. While it was interesting to see how Amazon had made it very simple for developers to get a production PostgreSQL instance quickly, the thing that really caught my eye was the performance benchmarks comparing the fsync and synchronous commit parameters.

sync_commit Frighteningly, it is not that uncommon for people to turn off fsync to get a performance gain out of their PostgreSQL database. While the performance gain is dramatic, it carries the risk that your database could become corrupt. In some cases, this may be OK, but these cases are really rather rare. A more common case is a database where it is OK to lose a little data in the event of a crash. This is where synchronous commit comes in. When synchronous commit is off, the server returns back success immediately to the client, but waits to flush the data to disk for a short period of time. When the data is ultimately flushed it is still properly sync to disk so there is no chance of data corruption. The only risk if the event of a crash is that you may lose some transactions. The default setting for this window is 200ms.

In Grant’s talk, he performed a benchmark that showed turning off synchronous commit gave a bigger performance gain than turning off fsync. He performed an insert only test so I wanted to try a standard pgbench test. I didn’t come up with the same results, but the I still saw a compelling case for leaving fsync on while turning off synchronous commit.

I ran a pgbench test with 4 clients and a scaling factor of 100 on a small EC2 instance running 9.3.5. What I saw was turning off fsync resulted in a 150% performance. Turning off synchronous commit resulted in a 128% performance gain. Both are dramatic performance gains, but the synchronous commit option has a lot less risk.

Speaking of conferences, the call for papers is open for PGConf US 2015. If there is a topic you’d like to present in New York in March, submit it here.

↧

Feng Tian: TPCH on PostgreSQL (part 3)

November 10, 2014, 9:31 pm

≫ Next: Bruce Momjian: Postgres Rising in Russia

≪ Previous: Jim Mlodgenski: Synchronous Commit

Today is the biggest internet shopping day. If you read Chinese, you may find that Alibaba posts job openings for PostgreSQL dba from time to time. It would be interesting to find out how many transactions and/or analytic workloads inside Alibaba is handled by PostgreSQL.

Back to TPCH. Q16 is particularly tough for us to optimize. Again, it is better to look at a simpler example,

ftian=# create table t as select x, x % 10 as i, x % 100 as j, 'random string' || (x % 100) as s from generate_series(1, 1000000) x;

SELECT 1000000

Time: 784.034 ms

ftian=# select count(distinct i) from t group by s, j;

Time: 7991.994 ms

Grouping one million rows in 8 sec, kind of slow. So let's try,

ftian=# select count(i) from t group by s, j;

Time: 99.029 ms

So it must be count(distinct). Well, we wasted quite some time chasing the distinct. Profiling, before optimizing, we should've known better. Real reason is the select count(distinct) will trigger a sort agg instead of hash agg. Distinct, will need some resource in aggregate function and it is very hard to cap the consumption in hash agg. So a sort agg is used, which actually is a fine plan.

ftian=# explain select count(distinct i) from t group by s, j;

QUERY PLAN

------------------------------------------------------------------------

GroupAggregate (cost=117010.84..127110.84 rows=10000 width=23)

-> Sort (cost=117010.84..119510.84 rows=1000000 width=23)

Sort Key: s, j

-> Seq Scan on t (cost=0.00..17353.00 rows=1000000 width=23)

(4 rows)

Is sort that slow?

ftian=# select count(distinct i) from t group by j, s;

Time: 1418.938 ms

5x faster. What is going on? Note that we switched grouping order from group by s, j to group by j, s. These two queries are equivalent, except the order of result -- well, if you really care about ordering, you should have an order by clause.

The true cost lies in string comparison with collation. The database uses utf-8 encoding and sort on text will use string comparison with collation. When group by (s, j), we will always compare two strings. When group by (j, s), string comparison is only executed when j are equal.

Finally, to see how expensive compare with collation is,

ftian=# select count(distinct i) from t group by decode(s, 'escape'), j;

Time: 2831.271 ms

Even we go through the not so cheap decode hoop, still, almost 3x faster.

So what can we do in this situation?

If your group by clause is group by t, i; consider rewrite your query as group by i, t; Basically, put faster (int, float, etc) data types with many distinct values first.
Wait for a patch from Peter Geoghegan. If the string has a random prefix, the patch will speed up string comparison considerably. But, it won't help this example, because we have an common prefix. Same common prefix problem for TPCH data. On the other hand, these are synthetic data, and Peter's patch should help a lot in many situations.

A final remark. Can we get rid of the sort completely? Yes, if there is only one distinct,

ftian=# explain select count(i) from (select i, j, s from t group by i, j, s) tmpt group by s, j;

QUERY PLAN

------------------------------------------------------------------------

HashAggregate (cost=27603.00..27703.00 rows=10000 width=23)

-> HashAggregate (cost=24853.00..25853.00 rows=100000 width=23)

-> Seq Scan on t (cost=0.00..17353.00 rows=1000000 width=23)

(3 rows)

ftian=# select count(i) from (select i, j, s from t group by i, j, s) tmpt group by s, j;

Time: 106.915 ms

This is how fast the query can be, even though we have two HashAggregate nodes. I wish one day, PostgreSQL planner can do this trick for me.

P.S We did not rewrite Q16 in our benchmark after all. It is an interesting case that we want to keep an eye on.

↧

Bruce Momjian: Postgres Rising in Russia

November 11, 2014, 2:30 pm

≫ Next: Brian Dunavant: Writeable CTEs to improve performance

≪ Previous: Feng Tian: TPCH on PostgreSQL (part 3)

I just returned from two weeks in Russia, and I am happy to report that Postgres is experiencing strong growth there. I have regularly complained that Russian Postgres adoption was lagging, but the sanctions have tipped the scales and moved Russia into Postgres-hyper-adoption mode. New activities include:

A Postgres Day in Saint Petersburg in July
Moscow meetings in September and October with 150 attendees
An October Postgres Day in Saint Petersburg
A strong Postgres presence and booth at Highload++
2015 conferences in Moscow and Saint Petersburg

As part of my visit I spoke at EnterpriseDB partner LANIT. The hour-long presentation was recorded and covers:

The Postgres community development process
Comparison of Postgres to proprietary databases
The future direction of Postgres

↧

Brian Dunavant: Writeable CTEs to improve performance

November 11, 2014, 2:37 pm

≫ Next: Michael Paquier: Postgres 9.5 feature highlight: BRIN indexes

≪ Previous: Bruce Momjian: Postgres Rising in Russia

I wrote an article on my company blog on using Postgres’ writable CTE feature to improve performance and write cleaner code. The article is available at:

http://omniti.com/seeds/writable-ctes-improve-performance

I’ve been told by a number of people I should expand it further to include updates and the implications there and then consider doing a talk on it at a Postgres conference. Hrm….

↧

Michael Paquier: Postgres 9.5 feature highlight: BRIN indexes

November 12, 2014, 12:35 am

≫ Next: Joshua Drake: AWS performance: Results included

≪ Previous: Brian Dunavant: Writeable CTEs to improve performance

A new index type, called BRIN. or Block Range INdex is showing up in PostgreSQL 9.5, introduced by this commit:

commit: 7516f5259411c02ae89e49084452dc342aadb2ae
author: Alvaro Herrera <alvherre@alvh.no-ip.org>
date: Fri, 7 Nov 2014 16:38:14 -0300
BRIN: Block Range Indexes

BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes.  They work by maintaining "summary" data about
block ranges.  Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not.  Normal index scans are not supported
because these indexes do not store TIDs.

By nature, using a BRIN index for a query scan is a kind of mix between a sequential scan and an index scan because what such an index scan is storing a range of data for a given fixed number of data blocks. So this type of index finds its advantages in very large relations that cannot sustain the size of for example a btree where all values are indexed, and that is even better with data that has a high ordering across the relation blocks. For example let's take the case of a simple table where the data is completely ordered across data pages like this one with 100 million tuples:

=# CREATE TABLE brin_example AS SELECT generate_series(1,100000000) AS id;
SELECT 100000000
=# CREATE INDEX btree_index ON brin_example(id);
CREATE INDEX
Time: 239033.974 ms
=# CREATE INDEX brin_index ON brin_example USING brin(id);
CREATE INDEX
Time: 42538.188 ms
=# \d brin_example
Table "public.brin_example"
 Column |  Type   | Modifiers
--------+---------+-----------
 id     | integer |
Indexes:
    "brin_index" brin (id)
    "btree_index" btree (id)

Note that the creation of the BRIN index was largely faster: it has less index entries to write so it generates less traffic. By default, 128 blocks are used to calculate a range of values for a single index entry, this can be set with the new storage parameter pages_per_range using a WITH clause.

=# CREATE INDEX brin_index_64 ON brin_example USING brin(id)
    WITH (pages_per_range = 64);
CREATE INDEX
=# CREATE INDEX brin_index_256 ON brin_example USING brin(id)
   WITH (pages_per_range = 256);
CREATE INDEX
=# CREATE INDEX brin_index_512 ON brin_example USING brin(id)
   WITH (pages_per_range = 512);
   CREATE INDEX

Having a look at the relation sizes, BRIN indexes are largely smaller in size.

=# SELECT relname, pg_size_pretty(pg_relation_size(oid))
    FROM pg_class WHERE relname LIKE 'brin_%' OR
         relname = 'btree_index' ORDER BY relname;
    relname     | pg_size_pretty
----------------+----------------
 brin_example   | 3457 MB
 brin_index     | 104 kB
 brin_index_256 | 64 kB
 brin_index_512 | 40 kB
 brin_index_64  | 192 kB
 btree_index    | 2142 MB
(6 rows)

Let's have a look at what kind of plan is generated then for scans using the btree index and the BRIN index on the previous table.

=# EXPLAIN ANALYZE SELECT id FROM brin_example WHERE id = 52342323;
                                      QUERY PLAN
---------------------------------------------------------------------------------
Index Only Scan using btree_index on brin_example
      (cost=0.57..8.59 rows=1 width=4) (actual time=0.031..0.033 rows=1 loops=1)
   Index Cond: (id = 52342323)
   Heap Fetches: 1
 Planning time: 0.200 ms
 Execution time: 0.081 ms
(5 rows)
=# EXPLAIN ANALYZE SELECT id FROM brin_example WHERE id = 52342323;
                                       QUERY PLAN
--------------------------------------------------------------------------------------
 Bitmap Heap Scan on brin_example
       (cost=20.01..24.02 rows=1 width=4) (actual time=11.834..30.960 rows=1 loops=1)
   Recheck Cond: (id = 52342323)
   Rows Removed by Index Recheck: 115711
   Heap Blocks: lossy=512
   ->  Bitmap Index Scan on brin_index_512
       (cost=0.00..20.01 rows=1 width=0) (actual time=1.024..1.024 rows=5120 loops=1)
          Index Cond: (id = 52342323)
 Planning time: 0.196 ms
 Execution time: 31.012 ms
(8 rows)

The btree index is or course faster, in this case an index only scan is even doable. Now remember that BRIN indexes are lossy, meaning that not all the blocks fetched back after scanning the range entry may contain a target tuple.

A last thing to notice is that pageinspect has been updated with a set of functions to scan pages of a BRIN index:

=# SELECT itemoffset, value
   FROM brin_page_items(get_raw_page('brin_index', 5), 'brin_index') LIMIT 5;
 itemoffset |         value
------------+------------------------
          1 | {35407873 .. 35436800}
          2 | {35436801 .. 35465728}
          3 | {35465729 .. 35494656}
          4 | {35494657 .. 35523584}
          5 | {35523585 .. 35552512}
(5 rows)

With its first shot, BRIN indexes come with a set of operator classes able to perform min/max calculation for each set of pages for most of the common datatypes. The list is available here. Note that the design of BRIN indexes make possible the implementation of new operator classes with operations more complex than simple min/max, one of the next operators that may show up would be for point and bounding box calculations.

↧

Joshua Drake: AWS performance: Results included

November 12, 2014, 8:00 am

≫ Next: Joshua Tolley: Dear PostgreSQL: Where are my logs?

≪ Previous: Michael Paquier: Postgres 9.5 feature highlight: BRIN indexes

I am not a big fan of AWS. It is a closed platform. It is designed to be the Apple of the Cloud to the Eve of Postgres users. That said, customers drive business and some of our customers use AWS, even if begrudgingly. Because of these factors we are getting very good at getting PostgreSQL to perform on AWS/EBS, albeit with some disclosures:

That high IO latency is an acceptable business requirement.
That you are willing to spend a lot of money to get performance you can get for less money using bare metal: rented or not. Note: This is a cloud issue not an AWS issue.

Using the following base configuration (see adjustments for each configuration after the graphic):

port = 5432                             
max_connections = 500                   
ssl = true                              
shared_buffers = 4GB                    
temp_buffers = 8MB                      
work_mem = 47MB                         
maintenance_work_mem = 512MB            
wal_level = hot_standby                 
synchronous_commit = on         
commit_delay = 0                        
commit_siblings = 5                     
checkpoint_segments = 30               
checkpoint_timeout = 10min              
checkpoint_completion_target = 0.9      
random_page_cost = 1.0                  
effective_cache_size = 26GB

Each test was run using pgbench against 9.1 except for configuration 9 which was 9.3:

pgbench -F 100 -s 100 postgres -c 500 -j10 -t1000 -p5433

Here are some of our latest findings:

The AWS configuration is:

16 Cores
30G of memory (free -h reports 29G)
(2) PIOPS volumes at 2000 IOPS a piece.
The PIOPS volumes are not in A RAID and are mounted separately.
The PIOPS volumes are formatted with xfs and default options
The PIOPS volumes were warmed.

Configuration 1:
$PGDATA and pg_xlog on the same partition
synchronous_commit = on
Configuration 2:
$PGDATA and pg_xlog on the same partition
synchronous_commit = off
Configuration 3:
$PGDATA and pg_xlog on the same partition
synchronous_commit = off
commit_delay = 100000
commit_siblings = 50
Configuration 4:
$PGDATA and pg_xlog on the same partition
synchronous_commit = off
commit_delay = 100000
commit_siblings = 500
Configuration 5:
$PGDATA and pg_xlog on different partitions
synchronous_commit = off
commit_delay = 100000
commit_siblings = 500
Configuration 6:
$PGDATA and pg_xlog on different partitions
synchronous_commit = on
commit_delay = 100000
commit_siblings = 500
Configuration 7:
$PGDATA and pg_xlog on different partitions
synchronous_commit = on
commit_delay = 0
commit_siblings = 5
Configuration 8:
$PGDATA and pg_xlog on different partitions
synchronous_commit = on
checkpoint_segments = 300
checkpoint_timeout = 60min
Configuration 9:
$PGDATA and pg_xlog on different partitions
PostgreSQL 9.3
synchronous_commit = on
checkpoint_segments = 300
checkpoint_timeout = 60min

↧

Joshua Tolley: Dear PostgreSQL: Where are my logs?

November 12, 2014, 3:33 pm

≫ Next: Gabriele Bartolini: Italian PGDay, eight edition: over 120 attendees!

≪ Previous: Joshua Drake: AWS performance: Results included

From Flickr user Jitze Couperus

When debugging a problem, it's always frustrating to get sidetracked hunting down the relevant logs. PostgreSQL users can select any of several different ways to handle database logs, or even choose a combination. But especially for new users, or those getting used to an unfamiliar system, just finding the logs can be difficult. To ease that pain, here's a key to help dig up the correct logs.

Where are log entries sent?

First, connect to PostgreSQL with psql, pgadmin, or some other client that lets you run SQL queries, and run this:

foo=# show log_destination ;
 log_destination 
-----------------
 stderr
(1 row)

The log_destination setting tells PostgreSQL where log entries should go. In most cases it will be one of four values, though it can also be a comma-separated list of any of those four values. We'll discuss each in turn.

SYSLOG

Syslog is a complex beast, and if your logs are going here, you'll want more than this blog post to help you. Different systems have different syslog daemons, those daemons have different capabilities and require different configurations, and we simply can't cover them all here. Your syslog may be configured to send PostgreSQL logs anywhere on the system, or even to an external server. For your purposes, though, you'll need to know what "ident" and "facility" you're using. These values tag each syslog message coming from PostgreSQL, and allow the syslog daemon to sort out where the message should go. You can find them like this:

foo=# show syslog_facility ;
 syslog_facility 
-----------------
 local0
(1 row)

foo=# show syslog_ident ;
 syslog_ident 
--------------
 postgres
(1 row)

Syslog is often useful, in that it allows administrators to collect logs from many applications into one place, to relieve the database server of logging I/O overhead (which may or may not actually help anything), or any number of other interesting rearrangements of log data.

EVENTLOG

For PostgreSQL systems running on Windows, you can send log entries to the Windows event log. You'll want to tell Windows to expect the log values, and what "event source" they'll come from. You can find instructions for this operation in the PostgreSQL documentation discussing server setup.

STDERR

This is probably the most common log destination (it's the default, after all) and can get fairly complicated in itself. Selecting "stderr" instructs PostgreSQL to send log data to the "stderr" (short for "standard error") output pipe most operating systems give every new process by default. The difficult is that PostgreSQL or the applications that launch it can then redirect this pipe to all kinds of different places. If you start PostgreSQL manually with no particular redirection in place, log entries will be written to your terminal:

[josh@eddie ~]$ pg_ctl -D $PGDATA start
server starting
[josh@eddie ~]$ LOG:  database system was shut down at 2014-11-05 12:48:40 MST
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
LOG:  statement: select syntax error;
ERROR:  column "syntax" does not exist at character 8
STATEMENT:  select syntax error;

In these logs you'll see the logs from me starting the database, connecting to it from some other terminal, and issuing the obviously erroneous command "select syntax error". But there are several ways to redirect this elsewhere. The easiest is with pg_ctl's -l option, which essentially redirects stderr to a file, in which case the startup looks like this:

[josh@eddie ~]$ pg_ctl -l logfile -D $PGDATA start
server starting

Finally, you can also tell PostgreSQL to redirect its stderr output internally, with the logging_collector option (which older versions of PostgreSQL named "redirect_stderr"). This can be on or off, and when on, collects stderr output into a configured log directory.

So if you end see a log_destination set to "stderr", a good next step is to check logging_collector:

foo=# show logging_collector ;
 logging_collector 
-------------------
 on
(1 row)

In this system, logging_collector is turned on, which means we have to find out where it's collecting logs. First, check log_directory. In my case, below, it's an absolute path, but by default it's the relative path "pg_log". This is relative to the PostgreSQL data directory. Log files are named according to a pattern in log_filename. Each of these settings is shown below:

foo=# show log_directory ;
      log_directory      
-------------------------
 /home/josh/devel/pg_log
(1 row)

foo=# show data_directory ;
       data_directory       
----------------------------
 /home/josh/devel/pgdb/data
(1 row)

foo=# show log_filename ;
          log_filename          
--------------------------------
 postgresql-%Y-%m-%d_%H%M%S.log
(1 row)

Documentation for each of these options, along with settings governing log rotation, is available here.

If logging_collector is turned off, you can still find the logs using the /proc filesystem, on operating systems equipped with one. First you'll need to find the process ID of a PostgreSQL process, which is simple enough:

foo=# select pg_backend_pid() ;
 pg_backend_pid 
----------------
          31950
(1 row)

Then, check /proc/YOUR_PID_HERE/fd/2, which is a symlink to the log destination:

[josh@eddie ~]$ ll /proc/31113/fd/2
lrwx------ 1 josh josh 64 Nov  5 12:52 /proc/31113/fd/2 -> /var/log/postgresql/postgresql-9.2-local.log

CSVLOG

The "csvlog" mode creates logs in CSV format, designed to be easily machine-readable. In fact, this section of the PostgreSQL documentation even provides a handy table definition if you want to slurp the logs into your database. CSV logs are produced in a fixed format the administrator cannot change, but it includes fields for everything available in the other log formats. For these to work, you need to have logging_collector turned on; without logging_collector, the logs simply won't show up anywhere. But when configured correctly, PostgreSQL will create CSV format logs in the log_directory, with file names mostly following the log_filename pattern. Here's my example database, with log_destination set to "stderr, csvlog" and logging_collector turned on, just after I start the database and issue one query:

[josh@eddie ~/devel/pg_log]$ ll
total 8
-rw------- 1 josh josh 611 Nov 12 16:30 postgresql-2014-11-12_162821.csv
-rw------- 1 josh josh 192 Nov 12 16:30 postgresql-2014-11-12_162821.log

The CSV log output looks like this:

[josh@eddie ~/devel/pg_log]$ cat postgresql-2014-11-12_162821.csv 
2014-11-12 16:28:21.700 MST,,,2993,,5463ed15.bb1,1,,2014-11-12 16:28:21 MST,,0,LOG,00000,"database system was shut down at 2014-11-12 16:28:16 MST",,,,,,,,,""
2014-11-12 16:28:21.758 MST,,,2991,,5463ed15.baf,1,,2014-11-12 16:28:21 MST,,0,LOG,00000,"database system is ready to accept connections",,,,,,,,,""
2014-11-12 16:28:21.759 MST,,,2997,,5463ed15.bb5,1,,2014-11-12 16:28:21 MST,,0,LOG,00000,"autovacuum launcher started",,,,,,,,,""
2014-11-12 16:30:46.591 MST,"josh","josh",3065,"[local]",5463eda6.bf9,1,"idle",2014-11-12 16:30:46 MST,2/10,0,LOG,00000,"statement: select 'hello, world!';",,,,,,,,,"psql"

↧

Gabriele Bartolini: Italian PGDay, eight edition: over 120 attendees!

November 14, 2014, 7:04 am

≫ Next: Robins Tharakan: Explain-Plan-Nodes Grid

≪ Previous: Joshua Tolley: Dear PostgreSQL: Where are my logs?

Group photo PGDayIT November 7th 2014 was the eight Italian PostgreSQL Day, the national event dedicated to the promotion of the world’s most advanced open source database. The Italian edition is one of the most enduring in the whole Postgres community (the first one took place in July 2007) and the results of the activity of a very established non profit organisation such as ITPUG (Italian PostgreSQL Users Group).

The Italian PGDay took place in Prato, historical location for this event, in the premises of the Prato campus (PIN) of the University of Florence. And for the first time, the attendance of the event went over 100 people, with a final counting of 124 registered people (including speakers and staff). I was also extremely happy to notice a relevant presence of women at PGDay – I believe around 10%.

It was a pleasure to have international speakers like Magnus and Simon, in Prato for the nth time, as well as a new entry like Mladen Marinovic from Croatia. There were 14 talks in total, spread in two parallel sessions, and an interactive training session (ITPUG labs) in the second room.

I was delighted to deliver the opening keynote, a summary of my experience and relationship with PostgreSQL from both a community and professional level. It was focused on us, knowledge workers, that can decide to invest in open source for our continuous improvement. And what better than studying (as well as teaching in schools) software like Linux and PostgreSQL? I then quickly outlined the most common objections towards the adoption of PostgreSQL (including the funniest or more depressing ones) that I have encountered so far in my career (e.g.: “I just want to know: Can Postgres manage millions of records?”). Then unrolled the reasons why I believe choosing to adopt PostgreSQL now is the most wise and strategic choice/decision that can be made for a data management solution.

I want also to thank some important Italian companies that decided to come out and publicly said why Postgres is the right choice for their daily management of data, their main asset: I am talking about Navionics, JobRapido and Subito.it (thank you Laura, Paolo and Pietro).

On a final note, I want to thank all the volunteers and fellow ITPUG members that made this event possible. Being one of the founders of ITPUG and a former president, I am really proud to see the progress that the association has been making through the dedication and hard work of all the volunteers that donate their spare time to the promotion of Postgres in Italy. Thank you guys.

Here is the coverage on Twitter (#PGDayIT2014).

↧