Shaun M. Thomas: A Short Examination of pg_shard

March 12, 2015, 2:28 pm

≫ Next: gabrielle roth: PDXPUG: March meeting next week

≪ Previous: Rikard Pavelic: Fast Postgres from .NET

For part of today, I’ve been experimenting with the new-ish pg_shard extension contributed by CitusData. I had pretty high hopes for this module and was extremely excited to try it out. After screwing around with it for a while, I can say it has a lot of potential. Yet I can’t reasonably recommend it in its current form. The README file suggests quite a few understandable caveats, but it’s the ones they don’t mention that hurt a lot more.

Here’s what I encountered while experimenting:

No support for transactions.
No support for the EXPLAIN command to produce query plans.
No support for COPY for bulk loading.
A bug that causes worker nodes to reject use of CURRENT_DATE in queries.

The first two are probably the worst, and the third is hardly trivial. I’m pretty long-winded, but here’s my view on potential impact.

By itself, lacking transaction support makes pg_shard more of a toy in my opinion. This breaks the A in ACID, and as such, reduces PostgreSQL from a legitimate database, to a fun experiment in assuming bad data never makes it into a sharded table. I would never, in good conscience, deploy such a thing into a production environment.

By not providing EXPLAIN support, it is not possible to see what a query might do on a sharded cluster. This is not only incredibly dangerous, but makes it impossible to troubleshoot or optimize queries. Which shards would the query run on? How much data came from each candidate shard? There’s no way to know. It is possible to load the auto_explain module on each worker node to examine what it did, but there’s no way to check the query plan beforehand.

And what about COPY? The documentation states that INSERT is the only way to get data into a table. Outside of a transaction, multiple inserts are incredibly slow due to round trip time, single-transaction context, fsync delays, and the list goes on. I created a VM and threw a measly 100k individual inserts at a regular, unsharded table, and the whole job took over a minute. Replaying the script in a transaction cut that time down to ten seconds. On the pg_shard copy of the table with two worker nodes, the same inserts required two minutes and twenty seconds. For 100k records. Presumably this could be corrected by utilizing several loader threads in parallel, but I haven’t tested that yet.

The primary reason sharding might be used, is to horizontally scale a very large table. Based on the times I saw, it would take 15 days to load a table with one billion rows. The sample table I used was only four columns and had no indexes to slow the loading process. Yet the COPY statement needed only 300ms for the same amount of data, and could load one billion rows of that table in under an hour. So even if I ignored the lack of transactions and EXPLAIN support, getting our 20-billion rows of data into pg_shard simply wouldn’t be feasible.

I really, really wanted to consider pg_shard on one of the large multi-TB instances I administer. I still do. But for now, I’m going to watch the project and check in on it occasionally and see if they eventually work out these kinks. It’s a great prototype, and having CitusData behind it suggests it’ll eventually become something spectacular.

Of course, there’s always the possibility that an as yet unnamed project somewhere out there already is. If so, please point me to it; pg_shard teased my salivary glands, and I want more.

↧

gabrielle roth: PDXPUG: March meeting next week

March 12, 2015, 7:47 pm

≫ Next: Shaun M. Thomas: PG Phriday: Interacting with JSON and JSONB

≪ Previous: Shaun M. Thomas: A Short Examination of pg_shard

When: 6-8pm Thursday Mar 19, 2015
Where: Iovation
Who: Ed Snajder
What: Creating an auto-partition strategy

Table partitioning is a great way to reduce cost and IO on very large data sets, very often with analytics-type databases and systems that collect historical data. While Postgres does offer table partitioning and the advantages that go with it, the care and feeding of partitioned systems can be tedious and error-prone, and Postgres does not offer a lot of built-in tools to help to reduce administrative overhead.

Ed will share his approach to tackling this, with a set of functions that will take an existing table, create partitions on it, and ultimately migrate the data. He’ll also, time permitting, compare to the more mature pg_partman.

—

Our meeting will be held at Iovation, on the 32nd floor of the US Bancorp Tower at 111 SW 5th (5th & Oak). It’s right on the Green & Yellow Max lines. Underground bike parking is available in the parking garage; outdoors all around the block in the usual spots. No bikes in the office, sorry!

Elevators open at 5:45 and building security closes access to the floor at 6:30.

The building is on the Green & Yellow Max lines. Underground bike parking is available in the parking garage; outdoors all around the block in the usual spots.

See you there!

↧

Shaun M. Thomas: PG Phriday: Interacting with JSON and JSONB

March 13, 2015, 9:33 am

≫ Next: Marco Slot: PostgreSQL and CitusDB on Raspberry Pi 2

≪ Previous: gabrielle roth: PDXPUG: March meeting next week

With the release of PostgreSQL 9.4, comes the ability to use binary JSON objects. This internal representation is faster and more capable than the original JSON included in 9.3. But how do we actually interact with JSON and JSONB in a database connection context? The answer is actually a little complicated and somewhat surprising.

Casting. Casting Everywhere.

Despite its inclusion as an internal type, PostgreSQL maintains its position as encouraging explicit casting to avoid bugs inherent in magic type conversions. Unfortunately, JSON blurs several lines in this regard, and this could lead to confusion on several fronts.

Let’s take a look at JSON first. Here are three very basic JSON documents for illustration:

{ "name": "cow-man" }
{ "weight": 389.4 }
{ "alive": true }

Nothing crazy. We have a string, a number, and a boolean. The PostgreSQL JSON type documentation suggests it handles these internally, which we can see for ourselves.

SELECT '{ "name": "cow-man" }'::JSON;

         json          
-----------------------
 { "name": "cow-man" }

SELECT '{ "weight": 389.4 }'::JSON;

        json         
---------------------
 { "weight": 389.4 }

SELECT '{ "alive": true }'::JSON;

       json        
-------------------
 { "alive": true }

Great! We can see the string, the number, and the boolean preserved in PostgreSQL’s encoding. Things start to go a bit sideways when we pull fields, though:

SELECT '{ "name": "cow-man" }'::JSON->'name';

 ?column?  
-----------
 "cow-man"

So far, so good. The PostgreSQL JSON documentation for functions and operators says that the -> operator returns a JSON object. And indeed, we can re-cast this string to JSON:

SELECT '"cow-man"'::JSON;

   json    
-----------
 "cow-man"

What happens when we try to compare two JSON objects, though?

SELECT '{ "name": "cow-man" }'::JSON->'name' = '"cow-man"'::JSON;

ERROR:  operator does not exist: json = json

Wait… what? Hmm. Let’s try the same thing with JSONB:

SELECT '{ "name": "cow-man" }'::JSONB->'name' = '"cow-man"'::JSONB;

 ?column? 
----------
 t

That’s something of a surprise, isn’t it? It’s pretty clear from this that JSON and JSONB are much more than simply how the data gets encoded and stored. It also drastically affects how it’s possible to interact with the data itself.

Don’t relax yet, though! JSON and JSONB casting only succeed on TEXT or VARCHAR similar types. For example, these don’t work:

SELECT 365::JSON;
SELECT 365::JSONB;

But these do:

SELECT 365::TEXT::JSON;
SELECT 365::TEXT::JSONB;

So even though PostgreSQL acknowledges JSON datatypes, it can’t convert between those and its own internal types. A PostgreSQL NUMERIC is similar to a JSON NUMBER, but they’re not interchangeable, and can’t even be casted without first going through some kind of TEXT type. This is the same for boolean values. The only type that is treated natively is a string-based value.

While it may seem inconvenient to always use another type as an intermediary when interacting with JSON, that’s the current reality.

Just use TEXT and JSONB

If we reexamine the JSON type documentation, we also see the ->> operator. This not only pulls the indicated field, but automatically casts it to text. This means that we can turn this ugly monstrosity:

SELECT ('{ "field": "value" }'::JSON->'field')::TEXT;

Into this:

SELECT '{ "field": "value" }'::JSON->>'field';

From here, we can perform any action normally possible with a text-based value. This is the only way to pull a JSON or JSONB field directly into a PostgreSQL native type.

All of this would suggest that the safest way to work with JSON or JSONB is through text. Ironically, text is also the only way to exchange comparisons between JSON and JSONB. Observe:

SELECT '"moo"'::JSON = '"moo"'::JSONB;

ERROR:  operator does not exist: json = jsonb

And yet:

SELECT '"moo"'::JSON::TEXT = '"moo"'::JSONB::TEXT;

 ?column? 
----------
 t

Well, then. What this means is pretty clear: convert at the last minute, and always use some kind of text value when dealing with JSON and JSONB.

While I’m here, I’d also like to point out a somewhat amusing side-effect of how JSONB works as opposed to JSON. Textual data gets converted to JSONB automatically when JSONB is one of the equalities. What does that mean? All of these are valid, and note that I’m quoting everything so it’s treated as text:

SELECT '"moo"'::JSONB = '"moo"';
SELECT '365'::JSONB = '365';
SELECT 'true'::JSONB = 'true';

But all of these produce an error:

SELECT '"moo"'::JSON = '"moo"';
SELECT '365'::JSON = '365';
SELECT 'true'::JSON = 'true';

This alone suggests that the lack of interoperability between JSON and JSONB is more of an oversight, and that JSON is missing some casting rules. Hopefully, that means 9.5 will carry some corrections in this regard. It’s hard to imagine PostgreSQL will leave JSON as a lame-duck datatype that was only really useful for 9.3 while JSONB was being developed.

If not, I guess that means I don’t have to revisit this topic in the future. Everyone knows I love being lazy.

↧

Marco Slot: PostgreSQL and CitusDB on Raspberry Pi 2

March 9, 2015, 2:59 am

≫ Next: Marco Slot: cstore_fdw 1.2 release notes

≪ Previous: Shaun M. Thomas: PG Phriday: Interacting with JSON and JSONB

One of the nice things about working at Citus Data is that we get time and budget to work on personal growth projects. A few of us are playing with the Raspberry Pi. I recently got two Raspberry Pi 2's and installed the Raspbian distribution on 16GB SD cards.

Of course, the first thing I tried was to install CitusDB (our solution for massively scaling PostgreSQL across a cluster of commodity servers) and pg_shard and make them crunch some numbers. For this set-up, one of the Pi's acted as both master and worker node and the other one as a regular worker node. The Raspberry Pi 2 is a modest device, so I used a modest 1GB dataset from the TPC-H and ran the first query, which gave the following result:

	CitusDB	PostgreSQL
Q1	11.9s	85.0s

What's astonishing about the Raspberry Pi 2 is how cost-effective it is for running CitusDB. The total investment in this set-up, including the microSD cards, was about $100. A cluster of 4-5 Pi's should be on par with a high performance desktop twice the cost.

Now, you might be wondering whether I brought down the whole cluster by taking a picture due to the Raspberry Pi 2's camera shyness. Yes, yes I did. Next time I'll place the Pi's in separate locations such that CitusDB will mask the node failure when taking pictures :).

We don't currently distribute a package for Raspbian, but contact us if you would like to try yourself.

↧

Marco Slot: cstore_fdw 1.2 release notes

March 11, 2015, 5:33 pm

≫ Next: Pavel Stehule: long term monitoring communication between PostgreSQL client and PostgreSQL server

≪ Previous: Marco Slot: PostgreSQL and CitusDB on Raspberry Pi 2

Citus Data is excited to announce the release of cstore_fdw 1.2 which is available on GitHub at github.com/citusdata/cstore_fdw. cstore_fdw is Citus Data's open source columnar store extension for PostgreSQL.

The changes in this release include:

INSERT INTO ... SELECT ... support. You can now use the "INSERT INTO cstore_table SELECT ..." syntax to load data into a cstore table directly from regular tables.
COPY TO support. You can now use the COPY command to copy the contents of a cstore table to a file.
Improved Memory Usage. Some of our users reported that cstore_fdw was using too much memory when the number of columns was high. This version of cstore_fdw uses memory more efficiently, resulting in up to 90% less memory usage.

For installation and update instructions, please see cstore_fdw’s page in GitHub.

To learn more about what’s coming up for cstore_fdw see our development roadmap.

Got questions?

If you have questions about cstore_fdw, please contact us using the cstore-users Google group.

If you discover an issue when using cstore_fdw, please submit it to cstore_fdw’s issue tracker on GitHub.

Further information about cstore_fdw is available on our website where you can also find information on ways to contact us with questions.

↧

Pavel Stehule: long term monitoring communication between PostgreSQL client and PostgreSQL server

March 14, 2015, 6:15 am

≫ Next: Tomas Vondra: Making debugging with GDB a bit easier

≪ Previous: Marco Slot: cstore_fdw 1.2 release notes

We have some issue with our application and pgbouncer. We detect some new errors with very low frequency. One possibility how to detect a reason of these issues is monitoring the communication between our application and Postgres. I found a great tool pgShark. But I had to solve two issues.

I have to reduce logged content - lot of messages are unimportant for my purpose or generate lot of content. pgs-debug hasn't any option for it, so I had to modify source code. You can comment unwanted method. I disabled: Bind, BindComplete, CopyData, DataRow, Describe, Parse, ParseComplete, RowDescription, Sync. After this change the compressed log was few GB per day.
I had a output (log) with attached timestamp. I can do it simply in bash:
```
| while read line; do echo `date +"%T.%3N"` $line; done | 
```

I wrote a line:

unbuffer ./pgs-debug --host 127.0.0.1 -i lo --port 6432 | while read line; do echo `date +"%T.%3N"` $line; done | gzip > /mnt/large/pgsharklog.gz

It does what I need:

12:55:13.407 P=1425556513.403313, s=288765600856048 type=SSLRequest, F -> B
12:55:13.408 SSL REQUEST
12:55:13.409
12:55:13.411 P=1425556513.403392, s=288765600856048 type=SSLAnswer, B -> F
12:55:13.412 SSL BACKEND ANSWER: N
12:55:13.414
12:55:13.415 P=1425556513.403486, s=288765600856048 type=StartupMessage, F -> B
12:55:13.416 STARTUP MESSAGE version: 3
12:55:13.418 database=db_lc3hfmn22q8vdt6mhopr2wj4zskyaous
12:55:13.419 application_name=starjoin
12:55:13.420 user=beard
12:55:13.421
12:55:13.423 P=1425556513.403526, s=288765600856048 type=AuthenticationMD5Password, B -> F
12:55:13.424 AUTHENTIFICATION REQUEST code=5 (MD5 salt='fe45f1a1')
12:55:13.425
12:55:13.426 P=1425556513.403577, s=288765600856048 type=PasswordMessage, F -> B
12:55:13.428 PASSWORD MESSAGE password=md5a0cd0711e0e191467bca6e94c03fb50f
12:55:13.429
12:55:13.430 P=1425556513.403614, s=288765600856048 type=AuthenticationOk, B -> F
12:55:13.431 AUTHENTIFICATION REQUEST code=0 (SUCCESS)
12:55:13.433
12:55:13.434 P=1425556513.403614, s=288765600856048 type=ParameterStatus, B -> F
12:55:13.435 PARAMETER STATUS name='integer_datetimes', value='on'
12:55:13.436
12:55:13.437 P=1425556513.403614, s=288765600856048 type=ParameterStatus, B -> F
12:55:13.439 PARAMETER STATUS name='IntervalStyle', value='postgres'
12:55:13.440

↧

Tomas Vondra: Making debugging with GDB a bit easier

March 14, 2015, 12:00 pm

≫ Next: damien clochard: 70 Shades of Postgres

≪ Previous: Pavel Stehule: long term monitoring communication between PostgreSQL client and PostgreSQL server

I spend more and more time debugging PostgreSQL internals - analyzing bugs in my patches, working on new patches etc. That requires looking at structures used by the internals - optimizer, planner, executor etc. And those structures are often quite complex, nested in various ways so exploring the structure with a simple print gets very tedious very quickly:

(gdb) print plan$16=(PlannedStmt *) 0x2ab7dc0(gdb) print plan->planTree$17=(struct Plan *) 0x2ab6590(gdb) print plan->planTree->lefttree$18=(struct Plan *) 0x2ab5cf0(gdb) print plan->planTree->lefttree->lefttree$19=(struct Plan *) 0x2ab5528(gdb) print plan->planTree->lefttree->lefttree->lefttree$20=(struct Plan *) 0x2ab1290(gdb) print *plan->planTree->lefttree->lefttree->lefttree$21={type= T_SeqScan, startup_cost= 0, total_cost= 35.5, plan_rows= 2550,plan_width= 4, targetlist= 0x2ab4e48, qual= 0x0, lefttree= 0x0,righttree= 0x0, initPlan= 0x0, extParam= 0x0, allParam= 0x0}(gdb) print *(SeqScan*)plan->planTree->lefttree->lefttree->lefttree$22={plan={type= T_SeqScan, startup_cost= 0, total_cost= 35.5,plan_rows= 2550, plan_width= 4, targetlist= 0x2ab4e48, qual= 0x0,lefttree= 0x0, righttree= 0x0, initPlan= 0x0, extParam= 0x0,allParam= 0x0}, scanrelid= 1}

And then you move somewhere else and have to start from the scratch :-(

Fortunately, gdb provides a Python API so that it's possible to extend it quite easily - define new commands, change the way values are printed etc.

I've hacked a small script that defines a new command pgprint that makes it easier to make my life easier. It's not perfect nor the most beautiful code in the world, but you can do this for example:

(gdb) pgprint plantype: CMD_SELECT
      query ID: 0
    param exec: 0
     returning: False
 modifying CTE: False
   can set tag: True
     transient: False
  row security: False

     plan tree: 
        -> HashJoin (cost=202.125...342.812 rows=2550 width=16)
                target list:
                        TargetEntry (resno=1 resname="id"origtbl=16405 origcol=1 ...
                        TargetEntry (resno=2 resname="id"origtbl=16410 origcol=1 ...
                        TargetEntry (resno=3 resname="id"origtbl=16420 origcol=1 ...
                        TargetEntry (resno=4 resname="id"origtbl=16430 origcol=1 ...

                -> HashJoin (cost=134.750...240.375 rows=2550 width=12)
                        target list:
                                TargetEntry (resno=1 resname=(NULL)origtbl=0 ori ...
                                TargetEntry (resno=2 resname=(NULL)origtbl=0 ori ...
                                TargetEntry (resno=3 resname=(NULL)origtbl=0 ori ...

                        -> HashJoin (cost=67.375...137.938 rows=2550 width=8)
                                target list:
                                        TargetEntry (resno=1 resname=(NULL) origt ...
                                        TargetEntry (resno=2 resname=(NULL) origt ...
                ...
   range table:
        RangeTblEntry (kind=RTE_RELATION relid=16405 relkind=r)
        RangeTblEntry (kind=RTE_RELATION relid=16410 relkind=r)
        RangeTblEntry (kind=RTE_RELATION relid=16420 relkind=r)
        RangeTblEntry (kind=RTE_RELATION relid=16430 relkind=r)
 relation OIDs: [16405, 16410, 16420, 16430]
   result rels: (NIL)
  utility stmt: (NULL)
      subplans: (NIL)

It's available at github so just download it somewhere on disk, load it into gdb like this:

(gdb)source/home/tomas/work/gdbpg/gdbpg.py

and start using pgprint command. Or define your own commands - it's really simple.

Is that the best thing since sliced bread? Hardly, but hopefully it will make your life a bit easier.

The gdb Python API is way more powerful, so the are certainly ways to improve this simple script (aside from adding support for more Node types). Feel free to send me a patch, pull request or just an idea for improvement.

↧

damien clochard: 70 Shades of Postgres

March 15, 2015, 8:17 am

≫ Next: Leo Hsu and Regina Obe: DELETE all data really fast with TRUNCATE TABLE CASCADE

≪ Previous: Tomas Vondra: Making debugging with GDB a bit easier

Support for the SQL/MED standard was introduced in PostgreSQL 9.1 (2011). Four years later, we now have more the 70 Foreign Data Wrappers (FDW) available, which you can use PostgreSQL to read and write on type of data storage : Oracle, MongoDB, XML files, Hadoop, Git repos, Twitter, you name it, …

A few days ago, during FOSDEM I attended a talk by my colleague Ronan Dunklau about Foreign Data Wrappers in PostgreSQL : Where are we now ?. Ronan is the author of the multicorn extension and during his talk I couldn’t help but thinking the multicorn is probably one of the most underrated piece of code in the PostgreSQL Community. As a quick exercise I started to count all PostgreSQL FDW I knew about… and soon realized there were too many to fit my small memory.

So I went back to the FDW page in PostgreSQL wiki and started to update and clean up the catalog of all the existing FDW and ended with a list of 70 wrappers for PostgreSQL…

Lessons learned

During this process I learned a few things :

Almost all the major RDBMS are covered, except DB2. Strangely DB2 is also the only other RDBMS with a SQL/MED implementation. I’m not sure if those 2 facts are related or not.
One third of the FDW are written in Python and based on Multicorn. The others are “native” C wrappers. Obviously C wrappers are prefered for the database wrappers (ODBC, Oracle, etc.) for performance reasons. Meanwhile Multicorn is mostly used for query web services (S3 storage,RSS files, etc.) and specific data formats (Genotype files, GeoJSON, etc.)
I’m not sure they were meant to be this way, but Foreign Data Wrapper are also used to innovate. Some wrappers are diverted from the original purpose to implement new features. For instance, CitusDB has released a column-oriented storage, others wrappers are going in more exotic directions such as an OS level stat collector, GPU acceleration for seq scan, or a distbributed paralelle query engine, … Most of these project are not suited for production, of course. However it shows how easy you can implement a Proof of Concept and how versatile this tehcnology is…
The PGXN distribution network doesn’t seem to be well-kown among FDW developpers. Only 20% of the wrappers are available via PGXN. Or developers are too lazy to package their extensions ?
The success of PostgreSQL SQL/MED implementation comes with a drawback. My guess is that there will be more 100 data wrappers available by the end of 2015. For some data source like mongodb or LDAP, there’s already several different wrappers, not to mention the procession of Hadoop connectors :)In the long term, PostgreSQL end users might get confused by all these wrappers and it will be hard to know which one is maintained and which one is obsolete… The wiki page tries to answer that problem but there may be other solutions to provide information to the end user… Maybe we need a Foreign Data Wrapper that would list all the Foreign Data Wrappers :)

The next big thing : IMPORT FOREIGN SCHEMA

Moreover there’s still a big limitation th the current SQL/MED implementation : metadata. Right now, there’s no simple way to use the schema introspection capabilities of the foreign data source. This means you have to create all the foreign tables one by one. When you want to map all the tables of a distant database, it can be time-consuming and error-prone. If you connectinig to a remote PostgreSQL instance you can use tricks like this one but objectively this is a job the Wrapper should do by itself (when possible).

Here’s comme the new IMPORT FOREIGN SCHEMA statement ! This new features was written by Ronan Dunklau, Michael Paquier and Tom Lane. You can read a quick demo at Michael’s blog. This will be available in the forthcoming PostgreSQL 9.5.

This is a huge improvement ! This IMPORT feature combined and dozens of Wrappers available are 2 key factors : PostgreSQL is becoming a solid Data Integration Plateform and it reduces the need for external ETL software.

PostgreSQL as a Data Integration Plateform

Links :

↧

Leo Hsu and Regina Obe: DELETE all data really fast with TRUNCATE TABLE CASCADE

March 15, 2015, 11:18 pm

≫ Next: Dimitri Fontaine: a pgDay in Paris!

≪ Previous: damien clochard: 70 Shades of Postgres

Though it is a rare occurrence, we have had occasions where we need to purge ALL data from a table. Our preferred is the TRUNCATE TABLE approach because it's orders of magnitude faster than the DELETE FROM construct. You however can't use TRUNCATE TABLE unqualified, if the table you are truncating has foreign key references from other tables. In comes its extended form, the TRUNCATE TABLE .. CASCADE construct which was introduced in PostgreSQL 8.2, which will not only delete all data from the main table, but will CASCADE to all the referenced tables.

Continue reading "DELETE all data really fast with TRUNCATE TABLE CASCADE"

↧

Dimitri Fontaine: a pgDay in Paris!

March 16, 2015, 7:22 am

≫ Next: Josh Berkus: Benchmarking Postgres in the Cloud, part 1

≪ Previous: Leo Hsu and Regina Obe: DELETE all data really fast with TRUNCATE TABLE CASCADE

I was lucky to participate as a speaker to the Nordic PostgreSQL Day 2015 and it's been another awesome edition of the conference. Really smooth, everything has been running as it should, with about one hundred people at the conference.

In action at Nordic pgDay in Copenhaguen

You can get the slides I've been using for my talk at the Nordic pgDay 2015 page on the Conferences pages of this website.

The Nordic pgDay is such a successful conference that I long wanted to have just the same in my area, and so we made pgDay Paris and modeled it against Nordic. It's planned to be all the same, just with a different audience given the location.

April 21st, save the date and join us in Paris

The pgDay Paris welcomes English Speaking speakers and the Call for Papers is now open, so please consider submitting your talk proposal for pgDay Paris by cliking on the link and filling in a form. Not too hard a requirement for being allowed to visit such a nice city as Paris, really!

↧

Josh Berkus: Benchmarking Postgres in the Cloud, part 1

March 16, 2015, 11:14 pm

≫ Next: Francesco Canovai: Automating Barman with Puppet: it2ndq/barman (part one)

≪ Previous: Dimitri Fontaine: a pgDay in Paris!

In 2008, when Heroku started, there was only one real option for cloud hosting PostgreSQL: roll-your-own on EC2, or a couple other not-very-competitive platforms. Since then, we've seen the number of cloud hosting providers explode, and added several "PostgreSQL-As-A-Service" providers as well: first Heroku, then Gandi, CloudFoundry, RDS, OpenShift and more. This has led many of pgExperts' clients to ask: "Where should I be hosting my PostgreSQL?"

So to provide a definitive answer to that question, for the past several weeks I've been doing some head-to-head testing of different cloud hosting options for PostgreSQL. Even more work has been done by my collaborator, Ruben Rudio Rey of ManageACloud.com. I will be presenting on the results of this testing in a series of blog posts, together with a series of presentations starting at SCALE and going through pgConf NYC, LinuxFestNorthWest, and culminating at pgCon. Each presentation will add new tests and new data.

Here's my slides from SCALE, which compare AWS, RDS, and Heroku, if you want to get some immediate data.

What We're Testing

The idea is to run benchmarks against ephemeral instances of PostgreSQL 9.3 on each cloud or service. Our main goal is to collect performance figures, since while features and pricing are publicly available, performance information is not. And even when the specification is the same, the actual throughput is not. From each cloud or service, we are testing two different instance sizes:

Small: 1-2 cores, 3 to 4GB RAM, low throughput storage (compare EC2's m3.medium). This is the "economy" instance for running PostgreSQL; it's intended to represent what people with non-critical PostgreSQL instances buy, and to answer the question of "how much performance can I get for cheap".

Large: 8-16 cores, 48 to 70GB RAM, high throughput storage (compare EC2's r3.2xlarge). This is the maximum for a "high end" instance which we could afford to test in our test runs.

The clouds we're testing or plan to test include:

AWS EC2 "roll-your-own".
Amazon RDS PostgreSQL
Heroku
Google Compute Engine
DigitalOcean
Rackspace Cloud
OpenShift PostgreSQL Cartridge
(maybe Joyent, not sure)

Note that in many cases we're working with the cloud vendor to achieve maximum performance results. Our goal here isn't to "blind test" the various clouds, but rather to try to realistically deliver the best performance we can get on that platform. In at least one case, our findings have resulted in the vendor making improvements to their cloud platform, which then allowed us to retest with better results.

The tests we're running include three pgbench runs:

In-Memory, Read-Write (IMRW): pgbench database 30% to 60% of the size of RAM, full transaction workload
In-Memory, Read-Only (IMRO): pgbench database 30% to 60% of RAM, read-only queries
On-Disk, Read-Write (ODRW): pgbench database 150% to 250% of RAM, full transactions

The idea here is to see the different behavior profiles with WAL-bound, CPU-bound, and storage-bound workloads. We're also recording the load time for each database, since bulk loading behavior is useful information for each platform.

Each combination of cloud/size/test needs to then be run at least 5 times in order to get a statistically useful sample. As I will document later, often the difference between runs on the same cloud was greater than the difference between clouds.

Issues with pgBench as a Test Tool

One of the early things I discovered was some of the limitations of what pgbench could tell us. Its workload is 100% random access and homogeneous one-liner queries. It's also used extensively and automatically to test PostgreSQL performance. As a result, we found that postgresql.conf tuning made little or no difference at all, so our original plan to test "tuned" vs. "untuned" instances went by the wayside.

We also found on public clouds that, because of the rapidfire nature of pgbench queries, performance was dominated by network response times more than anything on most workloads. We did not use pgbench_tools, because that is concerned with automating many test runs against one host rather than a few test runs against many hosts.

For this reason, we also want to run a different, more "serious" benchmark which works out other performance areas. To support this, I'm working on deploying Jignesh's build of DVDStore so that I can do that benchmark against the various platforms. This will require some significant work to make a reality, though; I will need to create images or deployment tools on all of the platforms I want to test before I can do it.

To be continued ...

↧

Francesco Canovai: Automating Barman with Puppet: it2ndq/barman (part one)

March 17, 2015, 3:30 am

≫ Next: Amit Kapila: Different Approaches for MVCC used in well known Databases

≪ Previous: Josh Berkus: Benchmarking Postgres in the Cloud, part 1

This is not the first time that 2ndQuadrant has looked at Puppet. Gabriele Bartolini has already written an article in two parts on how to rapidly configure a PostgreSQL server through Puppet and Vagrant, accompanied by the release of the code used in the example on GitHub (https://github.com/2ndquadrant-it/vagrant-puppet-postgresql).

Split into three parts, the aim of this article is to demonstrate automation of the setup and configuration of Barman to backup a PostgreSQL test server.

This article is an update of what was written by Gabriele with the idea of creating two virtual machines instead of one, a PostgreSQL server and a Barman server.

it2ndq/barman is the module released by 2ndQuadrant Italy to manage the installation of Barman through Puppet. The module has a GPLv3 licence and is available on GitHub at the address https://github.com/2ndquadrant-it/puppet-barman. The following procedure was written for an Ubuntu 14.04 Trusty Tahr but can be performed in a similar manner on other distributions.

Requirements

To start the module for Barman on a virtual machine, we need the following software:

Vagrant

Vagrant is a virtual machine manager, capable of supporting many virtualisation softwares with VirtualBox as its default.

We install VirtualBox this way:

$ sudo apt-get install virtualbox virtualbox-dkms

The latest version of Vagrant can be downloaded from the site and installed with the command:

$ sudo dpkg -i /path/to/vagrant_1.7.2_x86_64.deb

Ruby

Regarding Ruby, our advice is to use rbenv, which creates a Ruby development environment in which to specify the version for the current user, thereby avoiding contaminating the system environment. To install rbenv we suggest to use rbenv-installer (https://github.com/fesplugas/rbenv-installer).

Let’s download and execute the script:

$ curl https://raw.githubusercontent.com/fesplugas/rbenv-installer/master/bin/rbenv-installer | bash

At the end, the script will prompt you to append the following lines to the ~/.bash_profile file:

exportRBENV_ROOT="${HOME}/.rbenv"if[ -d "${RBENV_ROOT}"];thenexportPATH="${RBENV_ROOT}/bin:${PATH}"eval"$(rbenv init -)"fi

We now need to reload the just changed ~/.bash_profile:

$ exec bash -l

At this point, we locally install a Ruby version (in this case, 2.1.5) and set the user to run this version rather than the system version:

$ rbenv install 2.1.5
$ rbenv global 2.1.5

Puppet

Puppet is required not only on the VMs but also on the machine running them. Therefore we need to install the Puppet gem.

$ gem install puppet

Librarian-puppet

Finally, librarian-puppet is a tool to automate the Puppet modules management. Like Puppet, librarian-puppet can be installed as a gem:

$ gem install librarian-puppet

Vagrant: configuration

Now that we have the dependencies in place, we can start to write the Vagrant and Puppet configurations for our backup system.

We start by creating a working directory:

$ mkdir ~/vagrant_puppet_barman
$ cd ~/vagrant_puppet_barman

Vagrant needs us to write a file called Vagrantfile where it looks for the configuration of the VMs.

The following Vagrantfile starts two Ubuntu Trusty VMs, called pg and backup, with ip addresses 192.168.56.221 and 192.168.56.222. On both machines provisioning will be performed through an inline shell script.

This script launches puppet-bootstrap (https://github.com/hashicorp/puppet-bootstrap), a script that automatically installs and configures Puppet on various types of machines. As it does not need to be run more than once, in the script a test was inserted to prevent further executions.

Vagrant.configure("2")do|config|{:pg =>{:ip      =>'192.168.56.221',:box     =>'ubuntu/trusty64'},:backup =>{:ip      =>'192.168.56.222',:box     =>'ubuntu/trusty64'}}.each do|name,cfg|
    config.vm.define name do|local|
      local.vm.box = cfg[:box]
      local.vm.hostname = name.to_s +'.local.lan'
      local.vm.network :private_network, ip: cfg[:ip]
      family ='ubuntu'
      bootstrap_url ='https://raw.github.com/hashicorp/puppet-bootstrap/master/'+ family +'.sh'# Run puppet-bootstrap only once
      local.vm.provision :shell,:inline =><<-eos
        if[!-e /tmp/.bash.provision.done ];then
          curl -L #{bootstrap_url}| bash
          touch /tmp/.bash.provision.done
        fi
      eos
    endendend

Bringing up the VMs

We have defined two Ubuntu Trusty VMs containing Puppet. This is not the final Vagrantfile but already allows the creation of the two machines. If you’re curious, it is possible to verify that the two machines have been created with the command:

$ vagrant up

and then connecting using the following commands:

$ vagrant ssh pg
$ vagrant ssh backup

Finally, the machines can be destroyed with:

$ vagrant destroy -f

Conclusions

In this first part of the tutorial we’ve seen how to configure the dependencies and ended up with the two virtual machines on which we’ll install, via Puppet, PostgreSQL and Barman. Writing the Puppet manifest for the actual installation will be the subject of the next article.

Bye for now!

↧

Amit Kapila: Different Approaches for MVCC used in well known Databases

March 17, 2015, 7:39 am

≫ Next: Joshua Drake: Stomping to PgConf.US: Webscale is Dead; PostgreSQL is King! A challenge, do you accept?

≪ Previous: Francesco Canovai: Automating Barman with Puppet: it2ndq/barman (part one)

Database Management Systems uses MVCC to avoid the problem of
Writers blocking Readers and vice-versa, by making use of multiple
versions of data.

There are essentially two approaches to multi-version concurrency.

Approaches for MVCC
The first approach is to store multiple versions of records in the
database, and garbage collect records when they are no longer
required. This is the approach adopted by PostgreSQL and
Firebird/Interbase. SQL Server also uses somewhat similar approach
with the difference that old versions are stored in tempdb
(database different from main database).

The second approach is to keep only the latest version of data in
the database, but reconstruct older versions of data dynamically
as required by using undo. This is approach adopted by Oracle
and MySQL/InnoDB

MVCC in PostgreSQL
In PostgreSQL, when a row is updated, a new version (called a tuple)
of the row is created and inserted into the table. The previous version
is provided a pointer to the new version. The previous version is
marked “expired", but remains in the database until it is garbage collected.

In order to support multi-versioning, each tuple has additional data
recorded with it:
xmin - The ID of the transaction that inserted/updated the
row and created this tuple.
xmax - The transaction that deleted the row, or created a
new version of this tuple. Initially this field is null.

Transaction status is maintained in CLOG which resides in $Data/pg_clog.
This table contains two bits of status information for each transaction;
the possible states are in-progress, committed, or aborted.

PostgreSQL does not undo changes to database rows when a transaction
aborts - it simply marks the transaction as aborted in CLOG . A PostgreSQL
table therefore may contain data from aborted transactions.

A Vacuum cleaner process is provided to garbage collect expired/aborted
versions of a row. The Vacuum Cleaner also deletes index entries
associated with tuples that are garbage collected.

A tuple is visible if its xmin is valid and xmax is not.
“Valid" means “either committed or the current transaction".
To avoid consulting the CLOG table repeatedly, PostgreSQL maintains
status flags in the tuple that indicate whether the tuple is “known committed"
or “known aborted".

MVCC in Oracle
Oracle maintain old versions in rollback segments (also known as
'undo log'). A transaction ID is not a sequential number; instead, it is
made of a set of numbers that points to the transaction entry (slot) in a
Rollback segment header.

Rollback segments have the property that new transactions can reuse
storage and transaction slots used by older transactions that are
committed or aborted.
This automatic reuse facility enables Oracle to manage large numbers
of transactions using a finite set of rollback segments.

The header block of the rollback segment is used as a transaction table.
Here the status of a transaction is maintained (called System Change Number,
or SCN, in Oracle). Rather than storing a transaction ID with each row
in the page, Oracle saves space by maintaining an array of unique transactions
IDs separately within the page, and stores only the offset of this array with
the row.

Along with each transaction ID, Oracle stores a pointer to the last undo record
created by the transaction for the page. Not only are table rows stored in this
way, Oracle employs the same techniques when storing index rows. This is
one of the major difference between PostgreSQL and Oracle.

When an Oracle transaction starts, it makes a note of the current SCN. When
reading a table or an index page, Oracle uses the SCN number to determine if
the page contains the effects of transactions that should not be visible to the
current transaction. Oracle checks the commit status of a transaction by
looking up the associated Rollback segment header, but, to save time, the first
time a transaction is looked up, its status is recorded in the page itself to avoid
future lookups.

If the page is found to contain the effects of invisible transactions, then Oracle
recreates an older version of the page by undoing the effects of each such
transaction. It scans the undo records associated with each transaction and
applies them to the page until the effects of those transactions are removed.
The new page created this way is then used to access the tuples within it.

Record Header in Oracle
A row header never grows, always a fixed size. For non-cluster tables,
the row header is 3 bytes. One byte is used to store flags, one byte to
indicate if the row is locked (for example because it's updated but not
committed), and one byte for the column count.

MVCC in SQL Server
Snapshot isolation and read committed using row versioning are enabled
at the database level. Only databases that require this option must enable
it and incur the overhead associated with it.

Versioning effectively starts with a copy-on-write mechanism that is
invoked when a row is modified or deleted. Row versioning–based
transactions can effectively "view" the consistent version of the data
from these previous row versions.

Row versions are stored within the version store that is housed within the
tempdb database. More specifically, when a record in a table or index is
modified, the new record is stamped with the "sequence_number" of the
transaction that is performing the modification.
The old version of the record is copied to the version store, and the new record
contains a pointer to the old record in the version store.
If multiple long-running transactions exist and multiple "versions" are required,
records in the version store might contain pointers to even earlier versions of
the row.

Version store cleanup in SQL Server
SQL Server manages the version store size automatically, and maintains a
cleanup thread to make sure it does not keep versioned rows around longer
than needed. For queries running under Snapshot Isolation, the version
store retains the row versions until the transaction that modified the data
completes and the transactions containing any statements that reference the
modified data complete. For SELECT statements running under
Read Committed Snapshot Isolation, a particular row version is no longer
required, and is removed, once the SELECT statement has executed.

If tempdb actually runs out of free space, SQL Server calls the cleanup
function and will increase the size of the files, assuming we configured the
files for auto-grow. If the disk gets so full that the files cannot grow,
SQL Server will stop generating versions. If that happens, any snapshot
query that needs to read a version that was not generated due to space
constraints will fail.

Record Header in SQL Server
4 bytes long
- two bytes of record metadata (record type)
- two bytes pointing forward in the record to the NULL bitmap. This is
offset to some actual data in record (fixed length columns).

Versioning tag - this is a 14-byte structure that contains a timestamp
plus a pointer into the version store in tempdb.
Here timestamp is trasaction_seq_number, the only time that rows get
versioning info added to record is when it’s needed to support a
versioning operation.

As the versioning information is optional, I think that is the reason
they could store this info in index records as well without much
impact.

Database	PostgreSQL	Oracle	SQL Server
Storage for Old Versions	In the main Segment (Heap/Index)	In the separate segment (Rollback Segment/Undo)	In the separate database (tempdb – known as version store)
Size of Tuple Header (bytes)	24	3	Fixed – 4 Variable - 14
Clean up	Vacuum	System Monitor Process (SMON)	Ghost Cleanup task

Conclusion of study
As other databases store version/visibility information in index, that makes
index cleanup easier (as it is no longer tied to heap for visibility information).
The advantage for not storing the visibility information in index is that for
Delete operations, we don't need to perform an index delete and probably the
size of index record could be somewhat smaller.

Oracle and probably MySQL (Innodb) needs to write the record in undo
segment for Insert statement whereas in PostgreSQL/SQL Server, the new
record version is created only when a row is modified or deleted.

Only changed values are written to undo whereas PostgreSQL/SQL Server
creates a complete new tuple for modified row. This avoids bloat in the main
heap segment.

Both Oracle and SQL Server has some way to restrict the growth of version
information whereas PostgreSQL/PPAS doesn't have any way.

↧

Joshua Drake: Stomping to PgConf.US: Webscale is Dead; PostgreSQL is King! A challenge, do you accept?

March 17, 2015, 8:35 am

≫ Next: Tomas Vondra: Performance since PostgreSQL 7.4 / fulltext

≪ Previous: Amit Kapila: Different Approaches for MVCC used in well known Databases

I submitted to PgConf.US. I submitted talks from my general pool. All of them have been recently updated. They are also all solid talks that have been well received in the past. I thought I would end up giving my, "Practical PostgreSQL Performance: AWS Edition" talk. It is a good talk, is relevant to today and the community knows of my elevated opinion of using AWS with PostgreSQL (there are many times it works just great, until it doesn't and then you may be stuck).

I also submitted a talk entitled: "Suck it! Webscale is Dead; PostgreSQL is King!". This talk was submitted as a joke. I never expected it to be accepted, it hadn't been written, the abstract was submitted on the fly, improvised and in one take. Guess which talk was accepted? "Webscale is Dead; PostgreSQL is King!". They changed the first sentence of the title which is absolutely acceptable. The conference organizers know their audience best and what should be presented.

What I have since learned is that the talk submission committee was looking for dynamic talks, dynamic content, and new, inspired ideas. A lot of talks that would have been accepted in years past weren't and my attempt at humor fits the desired outcome. At first I thought they were nuts but then I primed the talk at SDPUG/PgUS PgDay @ Southern California Linux Expo.

I was the second to last presenter on Thursday. I was one hour off the plane. I was only staying the night and flying home the next morning, early. The talk was easily the best received talk I have given. The talk went long, the audience was engaged, laughter, knowledge and opinions were abound. When the talk was over, the talk was given enthusiastic applause and with a definite need for water, I left the room.

I was followed by at least 20 people, if not more. I don't know how many there were but it was more than I have ever had follow me after a talk before. I was deeply honored by the reception. One set of guys that approached me said something to the effect of: "You seem like you don't mind expressing your opinions". At this point, some of you reading may need to get a paper towel for your coffee because those that know me, know I will readily express an opinion. I don't care about activist morality or political correctness. If you don't agree with me, cool. Just don't expect me to agree with you. My soapbox is my own, rent is 2500.00 a minute, get in line. I digress, what did those guys ask me about? Systemd, I don't think they were expecting my answer, because I don't really have a problem with Systemd.

Where am I going with this post? I am stomping my way to PgConf.US with an updated version of this talk (You always learn a few things after giving a performance). I am speaking in the first slot on Friday and I am going to do everything I can to bring it. I can't promise to be the best, I can promise to do everything in my power to be my best. I am being recorded this time. My performance will be on the inner tubes forever. I have no choice.

A challenge, do you accept?

I challenge all speakers at this voyage of PgConf.US to take it up a notch. If you were accepted, you have a responsibility to do so. Now, now, don't get me wrong. I am not suggesting that you put on a chicken suit and Fox News t-shirt to present. I am however suggesting that if you are a monotone speaker, try not to be. If you are boring, your audience will be bored and that is the last thing the conference, you or the audience wants. So speak from your diaphragm, engage the audience and make their time worth it!

↧

Tomas Vondra: Performance since PostgreSQL 7.4 / fulltext

March 17, 2015, 12:00 pm

≫ Next: Julien Rouhaud: Talking About OPM and PoWA at pgconf.ru

≪ Previous: Joshua Drake: Stomping to PgConf.US: Webscale is Dead; PostgreSQL is King! A challenge, do you accept?

After discussing the pgbench and TPC-DS results, it's time to look at the last benchmark, testing performance of built-in fulltext (and GIN/GiST index implementation in general).

The one chart you should remember from this post is this one, GIN speedup between 9.3 and 9.4:

Interpreting this chart is a bit tricky - x-axis tracks duration on PostgreSQL 9.3 (log scale), while y-axis (linear scale) tracks relative speedup 9.4 vs. 9.3, so 1.0 means 'equal performance', and 0.5 means that 9.4 is 2x faster than 9.3.

The chart pretty much shows exponential speedup for vast majority of queries - the longer the duration on 9.3, the higher the speedup on 9.4. That's pretty awesome, IMNSHO. What exactly caused that will be discussed later (spoiler: it's thanks to GIN fastscan). Also notice that almost no queries are slower on 9.4, and those few examples are not significantly slower.

Benchmark

While both pgbench and TPC-DS are well established benchmarks, there's no such benchmark for testing fulltext performance (as far as I know). Luckily, I've had played with the fulltext features a while ago, implementing archie - an in-database mailing list archive.

It's still quite experimental and I use it for testing GIN/GiST related patches, but it's suitable for this benchmark too.

So I've taken the current archives of PostgreSQL mailing lists, containing about 1 million messages, loaded them into the database and then executed 33k real-world queries collected from postgresql.org. I can't publish those queries because of privacy concerns (there's no info on users, but still ...), but the queries look like this:

SELECTidFROMmessagesWHEREbody_tsvector@@('optimizing & bulk & update')::tsqueryORDERBYts_rank(body_tsvector,('optimizing & bulk & update')::tsquery)DESCLIMIT100;

The number of search terms varies quite a bit - the simplest queries have a single letter, the most complex ones often tens of words.

PostgreSQL config

The PostgreSQL configuration was mostly default, with only minor changes:

shared_buffers= 512MB
work_mem= 64MB
maintenance_work_mem= 128MB
checkpoint_segments= 32
effective_cache_size= 4GB

Loading the data

We have to load the data first, of course. In this case that involves a fair amount of additional logic implemented either in Python (parsing the mbox files into messages, loading them into the database), or PL/pgSQL triggers (thread detection, ...). The time needed to load all the 1M messages, producing ~6GB database, looks like this:

Note: The chart only shows releases where the performance changed, so if only data for 8.2 and 9.4 are shown, it means that the releases up until 9.3 behave like 8.2 (more or less).

The common wisdom is that querying GIN indexes are faster than GiST, but that they are more expensive when it comes to maintenance (creation, etc).

If you look at PostgreSQL 8.2, the oldest release supporting GIN indexes, that certainly was true - the load took ~1300 seconds with GIN indexes and only ~800 seconds with GiST indexes. But 8.4 significantly impoved this, making the GIN indexes only slightly more expensive than GiST.

Of course, this is incremental load - it might look very differently if the indexes were created after all the data are loaded, for example. But I argue that the incremental performance is more important here, because that's what usually matters in actual applications.

The other argument might be that the overhead of the Python parser and PL/pgSQL triggers is overshadowing the GIN / GiST difference. That may be true, but that overhead should be about the same for both index types, so read-world applications are likely to have similar overhead.

So I believe that GIN maintenance is not significantly more expensive than GiST - at least in this particular benchmark, but probably in other applications too. I have no doubt it's possible to construct examples where GIN maintenance is much more expensive than GiST maintenance.

Query performance

The one thing that's missing in the previous section is query performance. Let's assume your workload is 90% reads, and GIN is 10x faster than GiST for the queries you do - how much you care if GIN maintenance is 10x more expensive than GiST, in that case? In most cases, you'll choose GIN indexes because that'll probably give you better performance overall. (It's more complicated, of course, but I'll ignore that here.)

So, how did the GIN and GiST performance evolved over time? GiST indexes were introduced first - in PostgreSQL 8.0 as a contrib module (aka extension in new releases), and then in core PostgreSQL 8.3. Using the 33k queries, the time to run all of them on each release is this (i.e. lower values are better):

Interesting. It took only ~3200 seconds on PostgreSQL 8.0 - 8.2, and then it slowed down to ~5200 seconds. That may be seen as a regression, but my understanding is that this is the cost of move into core - the contrib module was probably limited in various ways, and proper integration with the rest of the core required fixing these shortcomings.

What about GIN? This feature was introduced in PostgreSQL 8.2, directly as in-core feature (so not as contrib module first).

Interestingly it was gradually slowing down a bit (by about ~15% between 8.2 and 9.3) - I take it as a sign that we really need regular benchmarking as part of development. Then, on 9.4 the performance significantly improved, thanks to this change:

Improve speed of multi-key GIN lookups (Alexander Korotkov, Heikki Linnakangas)

also known as "GIN fastscan".

I was discussing GIN vs. GiST maincenance cost vs. query performance a few paragraphs back, so what is the performance difference between GIN and GiST?

Well, in this particular benchmark, GIN indexes are about 10x faster than GiST (would be sad otherwise, because fulltext is the primary use of GIN), and as we've seen before it was not much slower than GiST maintenance-wise.

GIN fastscan

So what is the GIN fastscan about? I'll try to explain this, although it's of course a significantly simplified explanation.

GIN indexes are used for indexing non-scalar data - for example when it comes to fulltext, each document (stored in a TEXT column as a single value) is transformed into tsvector, a list of words in the document (along with some other data, but that's irrelevant here). For example let's assume document with ID=10 contains the popular sentence

10=> "The quick brown fox jumps over the lazy dog"

This will get split into an array of words (this transformation may even remove some words, perform lemmatization):

10=> ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

If you build GIN index on this "vector" representation, the index will effectively invert the direction of the mapping by mapping words to IDs of all the rows containing that word (each row ID is a pair of block number and offset on that block):

"The"=> [(0,1), (0,10), (0,15), (2,4), ...]"quick"=> [(0,1), (0,2), (2,10), (2,15), (3,18), ...]"brown"=> [(1,10), (1,11), (1,12), ...]
...

Then, if you do a fulltext query on the document, say

SELECT*FROMdocumentsWHEREto_tsvector(body)@@to_tsquery('quick & fox');

it can simply fetch the lists for quick and fox and combine them, to get only IDs of the documents containing both words.

And this is exactly where GIN fastscan was applied. Until PostgreSQL 9.4, the performance of this combination step was determined by the longest list of document IDs, because it had to be walked. So if you had a query combining rare and common words (included in many documents, thus having a long lists of IDs), it was often slow.

GIN fastscan changes this, starting with the short posting lists, and combining the lists in a smart way (by using the fact that the lists of IDs are sorted), so that the duration is determined by the shortest list of IDs.

How much impact can this have? Let's see!

Compression

The fastscan is not the only improvement in 9.4 - the other significant improvement is compression of the posting lists (lists of row IDs). If you look at the previous example, you might notice that the posting list can be made quite compressible - you may sort the row IDs (first by block number, then by row offset). The block numbers will then repeat a lot, and the row offsets will be an increasing sequence.

This redundancy may be exploited by various encoding schemes - RLE, delta, ... and that's what was done in PostgreSQL 9.4. The result is that GIN indexes are often much smaller. How much smaller really depends on the dataset, but for the dataset used in this benchmark the size dropped to 50% - from ~630MB to ~330MB. Other developers reported up to 80% savings in some cases.

Relative speedup

The following chart (already presented at the beginning of this blog post) presents speedup of a random sample from the 33k queries (plotting all the queries would only make it less readable). It shows relative speedup depending on the duration on PostgreSQL 9.3, i.e. each points plots

x-axis (log-scale) - duration on PostgreSQL 9.3
y-axis (linear) - (duration on PostgreSQL 9.4) / (duration on PostgreSQL 9.3)

So if the query took 100 ms on PostgreSQL 9.3, and only takes 10 ms on PostgreSQL 9.4, this is represented by a point [100, 0.1].

There are a few interesting observations:

Only very few queries slowed down on PostgreSQL 9.4. Those queries are either very fast, taking less than 1ms, with a slow-down less than 1.6 (this may easily be a noise) or longer but with slowdown well below 10% (again, may be a noise).
Vast majority of queries is significantly faster than on PostgreSQL 9.3, which is clearly visible as an area with high density of the blue dots. The most interesting thing is that the higher the PostgreSQL 9.3 duration, the higher the speedup.

This is perfectly consistent with the GIN fastscan - the queries that combine frequent and rare words took time proportional to the frequent word on PostgreSQL 9.3, but thanks to fastscan the performance is determined by the rare words. Hence the exponential speedup.

Fulltext dictionaries

While I'm quite excited about the speedup, the actual performance depends on other things too - for example what dictionary you use. In this benchmark I've been using the english dictionary, based on a simple snowball stemmer - a simple algorithmic stemmer, not using any kind of dictionary.

If you're using a more complicated configuration - for example a dictionary-based stemmer, because that's necessary for your language, this may take quite a significant amount of time (especially if you're not using connection pooling and so the dictionaries need to be parsed over and over again - my shared_ispell project might be interesting in this case).

GIN indexes as bitmap indexes

PostgreSQL does not have traditional bitmap indexes, i.e. indexes serialized into simple on-disk bitmaps. There were attempts to do that feature in the past, but the gains never really outweighter the performance issues (locking and such), especially since 8.2 when bitmap index scans were implemented (i.e. construction of bitmaps from btree indexes at runtime).

But if you think about that, GIN indexes are really bitmap indexes, with different bitmap serialiation format. If you're craving for bitmap indexes (not uncommon in analytical workloads), you might try btree_gin extension which makes it possible to create GIN indexes on scalar types (by default GIN can be built only on vector-like types - tsvector and such).

Summary

The wisdom "GIN indexes are faster to query but more expensive to maintain" may not be true anymore, especially if the query performance is more important for you.
Load performance improved a lot, especially in PostgreSQL 8.2 (GiST) and 8.4 (GIN).
Query performance for GiST is mostly the same (at least since PostgreSQL 8.3 when GiST was included into core).
For GIN, the query performance was mostly the same until PostgreSQL 9.4, when the "fastscan" significantly improved performance of queries combining rare and frequent keys.

↧

Julien Rouhaud: Talking About OPM and PoWA at pgconf.ru

March 18, 2015, 1:09 am

≫ Next: Joshua Drake: WhatcomPUG meeting last night on: sqitch and... bitcoin friends were made!

≪ Previous: Tomas Vondra: Performance since PostgreSQL 7.4 / fulltext

Last month, I had the chance to talk about PostgreSQL monitoring, and present some of the tools I’m working on at pgconf.ru.

This talk was a good opportunity to work on an overview of existing projects dealing with monitoring or performance, see what may be lacking and what can be done to change this situation.

Here are my slides:

If you’re interested in this topic, or if you developped a tool I missed while writing these slides (my apologies if it’s the case), the official wiki page is the place you should go first.

I’d also like to thank all the pgconf.ru staff for their work, this conference was a big success, and the biggest postgresql-centric event ever organized.

Talking About OPM and PoWA at pgconf.ru was originally published by Julien Rouhaud at rjuju's home on March 18, 2015.

↧

Joshua Drake: WhatcomPUG meeting last night on: sqitch and... bitcoin friends were made!

March 18, 2015, 9:28 am

≫ Next: Devrim GÜNDÜZ: Mark your calendars: May 9 2015, PGDay.TR in Istanbul!

≪ Previous: Julien Rouhaud: Talking About OPM and PoWA at pgconf.ru

Last night I attended the second WhatcomPUG. This meeting was about Sqitch, a interesting database revision control mechanism. The system is written in Perl and was developed by David Wheeler of PgTap fame. It looks and feels like git. As it is written in Perl it definitely has too many options. That said, what we were shown works, works well and appears to be a solid and thorough system for the job.

I also met a couple of people from CoinBeyond. They are a point-of-sale software vendor that specializes in letting "regular" people (read: not I or likely the people reading this blog) use Bitcoin!

That's right folks, the hottest young currency in the market today is using the hottest middle aged technology for their database, PostgreSQL. It was great to see that they are also located in Whatcom County. The longer I am here, the more I am convinced that Whatcom County (and especially Bellingham) is a quiet tech center working on profitable ventures without the noise of places like Silicon Valley. I just keep running into people doing interesting things with technology.

Oh, for reference:

Twitter: @coinbeyond

Facebook: CoinBeyond

LinkedIn: Linkedin

↧

Devrim GÜNDÜZ: Mark your calendars: May 9 2015, PGDay.TR in Istanbul!

March 18, 2015, 12:38 pm

≫ Next: Hubert 'depesz' Lubaczewski: Waiting for 9.5 – array_offset() and array_offsets()

≪ Previous: Joshua Drake: WhatcomPUG meeting last night on: sqitch and... bitcoin friends were made!

Turkish PostgreSQL Users' and Developer's Association is organizing 4th PGDay.TR on May 9, 2015 at Istanbul. Dave Page, one of the community leaders, will be giving the keynote.

This year, we are going to have 1 full English track along with 2 Turkish tracks, so if you are close to Istanbul, please join us for a wonderful city, event and fun!

We are also looking for sponsors for this great event. Please email to sponsor@postgresql.org.tr for details.

See you in Istanbul!

Conference website: http://pgday.postgresql.org.tr/en/

↧

Hubert 'depesz' Lubaczewski: Waiting for 9.5 – array_offset() and array_offsets()

March 18, 2015, 1:28 pm

≫ Next: Robert Haas: Parallel Sequential Scan for PostgreSQL 9.5

≪ Previous: Devrim GÜNDÜZ: Mark your calendars: May 9 2015, PGDay.TR in Istanbul!

On 18th of March, Alvaro Herrera committed patch: array_offset() and array_offsets() These functions return the offset position or positions of a value in an array. Author: Pavel Stěhule Reviewed by: Jim Nasby It's been a while since my last “waiting for" post – mostly because while there is a lot of work happening, […]

↧

Robert Haas: Parallel Sequential Scan for PostgreSQL 9.5

March 18, 2015, 1:45 pm

≫ Next: Michael Paquier: Postgres 9.5 feature highlight: More flexible expressions in pgbench

≪ Previous: Hubert 'depesz' Lubaczewski: Waiting for 9.5 – array_offset() and array_offsets()

Amit Kapila and I have been working very hard to make parallel sequential scan ready to commit to PostgreSQL 9.5. It is not all there yet, but we are making very good progress. I'm very grateful to everyone in the PostgreSQL community who has helped us with review and testing, and I hope that more people will join the effort. Getting a feature of this size and complexity completed is obviously a huge undertaking, and a significant amount of work remains to be done. Not a whole lot of brand-new code remains to be written, I hope, but there are known issues with the existing patches where we need to improve the code, and I'm sure there are also bugs we haven't found yet.
Read more »

↧