Vasilis Ventirozos: Offsite replication problems and how to solve them.

September 16, 2014, 8:40 am

≫ Next: Paul Ramsey: PostGIS for Managers

≪ Previous: Pavel Stehule: nice unix filter pv

Those of us who use (and abuse) replication in daily basis know how cool and flexible it is. I've seen a lot of guides on how to setup streaming replication in 5 minutes, how to setup basic archiving and/or wal shipping replication but i haven't seen many guides combining these or implementing an offsite setup simulating latency, packet corruption, and basically what happens under network degradation.

In this post i will describe a resilient replication setup of 2 nodes and i will put it to the test. For this post i will use 2 debian VMs, PostgreSQL 9.4 beta2, OmniPITR 1.3.2 and netem.
Netem can be found on all current (2.6+) distributions and it can emulate variable delay, loss, duplication and re-ordering.

The Basics

Streaming replication is awesome, its fast , easy to setup, lightweight and near to realtime, but how it performs over the internet ?

I setup a simple streaming replica, set wal_segments and wal_keep_segments low (10 and 5). Now i wanna emulate how it will perform over a slow internet connection :
From lab2 and as root :
# tc qdisc add dev eth0 root tbf rate 20kbit buffer 1600 limit 3000

This will emulate an "almost" network outage limiting eth0 to 20kbit.
Next, hammer lab1 with transactions... a bit later :
FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00000001000000000000000A has already been removed
Makes sense right ? lab2 couldn't keep up with lab1, lab1 rotated all xlogs and the replica is now broken. I know that this example is a bit extreme, these settings would never be used for an offsite replica, in all fairness they aren't even suitable for a local replica.
But ! network outages happen, and especially on geographically distributed databases this WILL happen and because :
Matter will be damaged in direct proportion to its value.

So, lets configure something that will tolerate such failures and it will do much, much more.

Requirements
First of all, we want postgres to archive , we want wal files to be transferred compressed and on an encrypted channel. For all my wal management i will use OmniPITR , a wal management suite written by OmniTI that is simple to use, has almost no dependencies and makes everything so much better, i will use rsync over ssh to transfer my wals, bellow the archive_command and the recovery.conf entries for streaming + WAL shipping replication.
(please keep in mind that these settings are for the sake of this post, they are not to be used directly on production systems)

archive_command = '/home/postgres/omnipitr/bin/omnipitr-archive -D /opt/postgres/pgdata/ -dr gzip=postgres@lab2:/opt/postgres/walarchive -l /home/postgres/archive.log -v "%p"'

recovery.conf :
standby_mode = 'on'
primary_conninfo = 'user=postgres host=lab1 port=5432'
restore_command = '/home/postgres/omnipitr/bin/omnipitr-restore -l /home/postgres/restore.log -s gzip=/opt/postgres/walarchive -f /home/postgres/failover.now -p /home/postgres/pause.removal -t /tmp -ep hang -pp /opt/postgres/psql/bin/pg_controldata -sr -r -v %f %p'
Again, with the same wal settings i hammered the database with transactions, the replica started having delays because of the 20kbit limitation but eventually it caught up and everything was ok.

OmniPITR is hands down awesome, it can do much more than just archive and restore wals. You can delay your replica, you can take hot backups from the slave, that can even be encrypted on creation, you can output the backup from a slave directly to an offsite backup server with no need for extra disk space on the replica and more..
(On some occasions directly sending WALS to the slave might create issues, if you get into this situation, remember that you can always archive wals locally to the master and schedule a script to transfer all wals generated in time intervals that serve your needs.)

This all proves that WAL shipping replication on top of streaming is still very usable when it comes to remote/offsite slaves and you can always switch to only SR or only WAL shipping without problems.
Afterall, its good to know that in case of a network outage, your only limitation is disk space for archived wals.

Feedback and ideas for future posts are always welcome :)

Thanks for reading
- Vasilis

↧

Paul Ramsey: PostGIS for Managers

September 16, 2014, 12:48 pm

≫ Next: Hans-Juergen Schoenig: Next stop: Joining 1 million tables

≪ Previous: Vasilis Ventirozos: Offsite replication problems and how to solve them.

At FOSS4G this year, I wanted to take a run at the decision process around open source with particular reference to the decision to adopt PostGIS: what do managers need to know before they can get comfortable with the idea of making the move.

The Manager's Guide to PostGIS — Paul Ramsey from FOSS4G on Vimeo.

↧

Hans-Juergen Schoenig: Next stop: Joining 1 million tables

September 17, 2014, 1:00 am

≫ Next: Pavel Stehule: plpgsql_check rpm packages are available for PostgreSQL9.3 for RHEL7, 6

≪ Previous: Paul Ramsey: PostGIS for Managers

This week I started my preparations for one of my talks in Madrid. The topic is: “Joining 1 million tables”. Actually 1 million tables is quite a lot and I am not sure if there is anybody out there who has already tried to do something similar. Basically the idea is to join 1 million […]

↧

Pavel Stehule: plpgsql_check rpm packages are available for PostgreSQL9.3 for RHEL7, 6

September 17, 2014, 8:10 am

≫ Next: Quinn Weaver: RDS for Postgres: List of Supported Extensions

≪ Previous: Hans-Juergen Schoenig: Next stop: Joining 1 million tables

If you have a RHEL6, 7 based Linux distro and use PostgreSQL 9.3 from community repository, you can install plpgsql_check simply via yum.

↧

Quinn Weaver: RDS for Postgres: List of Supported Extensions

September 17, 2014, 1:18 pm

≫ Next: Chris Travers: PGObject Cookbook Part 2.1: Serialization and Deserialization of Numeric Fields

≪ Previous: Pavel Stehule: plpgsql_check rpm packages are available for PostgreSQL9.3 for RHEL7, 6

Today I learned that Amazon doesn't keep any list of extensions supported in PostgreSQL. Instead, their documentation tells you to start a psql session and run 'SHOW rds.extensions'. But that creates a chicken-and-egg situation if you have an app that needs extensions, and you're trying to decide whether to migrate.

So here's a list of extensions supported as of today, 2014-09-17 (RDS PostgreSQL 9.3.3). I'll try to keep this current.

btree_gin
btree_gist
chkpass
citext
cube
dblink
dict_int
dict_xsyn
earthdistance
fuzzystrmatch
hstore
intagg
intarray
isn
ltree
pgcrypto
pgrowlocks
pg_trgm
plperl
plpgsql
pltcl
postgis
postgis_tiger_geocoder
postgis_topology
sslinfo
tablefunc
tsearch2
unaccent
uuid-ossp

btree_gin

btree_gist

chkpass

citext

cube

dblink

dict_int

dict_xsyn

earthdistance

fuzzystrmatch

hstore

intagg

intarray

isn

ltree

pgcrypto

pgrowlocks

pg_trgm

plperl

plpgsql

pltcl

postgis

postgis_tiger_geocoder

postgis_topology

sslinfo

tablefunc

tsearch2

unaccent

uuid-ossp

↧

Chris Travers: PGObject Cookbook Part 2.1: Serialization and Deserialization of Numeric Fields

September 17, 2014, 5:37 pm

≫ Next: Denish Patel: Postgres in Amazon RDS

≪ Previous: Quinn Weaver: RDS for Postgres: List of Supported Extensions

Preface

This article demonstrates the simplest cases regarding autoserialization and deserialization to the database of objects in PGObject. It also demonstrates a minimal subset of the problems that three valued logic introduces and the most general solutions to those problems. The next article in this series will address more specific solutions and more complex scenarios.

The Problems

Often times we want to have database fields automatically turned into object types which are useful to an application. The example here turns SQL numeric fields into Perl Math::Bigfloat objects. However the transformation isn't perfect and if not carefully done can be lossy. Most applications types don't support database nulls properly and therefore a NULL making a round trip may end up with an unexpected value if we aren't careful. Therefore we have to create our type in a way which can make round trips in a proper, lossless way.

NULLs introduce another subtle problem with such mappings, in that object methods are usually not prepared to handle them properly. One solution here is to try to follow the basic functional programming approach and copy on write. This prevents a lot of problems. Most Math::BigFloat operations do not mutate the objects so we are relatively safe there, but we still have to be careful.

The simplest way to address this is to build into one's approach a basic sensitivity into three value logic. However, this poses a number of problems, in that one can accidentally assign a value which can have other values which can impact things elsewhere.

A key principle on all our types is that they should handle a null round trip properly for the data type, i.e. a null from the db should be turned into a null on database insert. We generally allow programmers to check the types for nulls, but don't explicitly handle them with three value logic in the application (that's the programmer's job).

The Example Module and Repository

This article follows the code of PGObject::Type::BigFloat.. The code is licensed under the two-clause BSD license as is the rest of the PGObject framework. You can read the code to see the boilerplate. I won't be including it in here. I will though note that this extends the Math::BigFloat library which provides arbitrary precision arithmetic for PostgreSQL and is a good match for LedgerSMB's numeric types.

NULL handling

To solve the problem of null inputs we extend the hashref slightly with a key _pgobject_undef and allow this to be set or checked by applications with a function "is_undef." This is fairly trivial:

sub is_undef {
my ($self, $set) = @_;
$self->{_pgobject_undef} = $set if defined $set;
return $self->{_pgobject_undef};
}

How PGObject Serializes

When a stored procedure is called, the mapper class calls PGObject::call_procedure with an enumerated set of arguments. A query is generated to call the procedure, and each argument is checked for a "to_db" method. That method, if it exists, is called and the output used instead of the argument provided. This allows an object to specify how it is serialized.

The to_db method may return either a literal value or a hashref with two keys, type and value. If the latter, the value is used as the value literal and the type is the cast type (i.e. it generates ?::type for the placeholder and binds the value to it). This hash approach is automatically used when bytea arguments are found.

The code used by PGObject::Type::BigFloat is simple:

sub to_db {
my $self = shift @_;
return undef if $self->is_undef;
return $self->bstr;
}

Any type of course can specify a to_db method for serialization purposes.

How and When PGObject Deserializes

Unlike serialization, deserialization from the database can't happen automatically without the developer specifying which database types correspond to which application classes, because multiple types could serialize into the same application classes. We might even want different portions of an application (for example in a database migration tool) to handle these differently.

For this reason, PGObject has what is called a "type registry" which specifies which types are deserialized and as what. The type registry is optionally segmented into several "registries" but most uses will in fact simply use the default registry and assume the whole application wants to use the same mappings. If a registry is not specified the default subregistry is used and that is consistent throughout the framework.

Registering a type is fairly straight forward but mostly amounts to boilerplate code in both the type handler and using scripts. For this type handler:

sub register{
my $self = shift @_;
croak "Can't pass reference to register \n".
"Hint: use the class instead of the object" if ref $self;
my %args = @_;
my $registry = $args{registry};
$registry ||= 'default';
my $types = $args{types};
$types = ['float4', 'float8', 'numeric'] unless defined $types and @$types;
for my $type (@$types){
my $ret =
PGObject->register_type(registry => $registry, pg_type => $type,
perl_class => $self);
return $ret unless $ret;
}
return 1;
}

Then we can just call this in another script as:

PGObject::Type::BigFloat->register;

Or we can specify a subset of types or different types, or the like.

The deserialization logic is handled by a method called 'from_db' which takes in the database literal and returns the blessed object. In this case:

sub from_db {

my ($self, $value) = @_;

my $obj = "$self"->new($value);

$obj->is_undef(1) if ! defined $value;

return $obj;

}

This supports subclassing, which is in fact the major use case.

Use Cases

This module is used as the database interface for numeric types in the LedgerSMB 1.5 codebase. We subclass this module and add support for localized input and output (with different decimal and thousands separators). This gives us a data type which can present itself to the user as one format and to the database as another. The module could be further subclassed to make nulls contageous (which in this module they are not) and the like.

Caveats

PGObject::Type::BigFloat does not currently handle making the null handling contageous and this module as such probably never will, as this is part of our philosophy of handing control to the programmer. Those who do want contageous nulls can override additional methods from Math::BigFloat to provide such in subclasses.

A single null can go from the db into the application and return to the db and be serialized as a null, but a running total of nulls will be saved in the db as a 0. To this point, that behavior is probably correct. More specific handling of nulls in the application, however, is passed to the developer which can check the is_undef method.

Next In Series: Advanced Serialization and Deserialization: Dates, Times, and JSON

↧

Denish Patel: Postgres in Amazon RDS

September 18, 2014, 11:42 am

≫ Next: Jeff Frost: WAL-E with Rackspace CloudFiles over Servicenet

≪ Previous: Chris Travers: PGObject Cookbook Part 2.1: Serialization and Deserialization of Numeric Fields

Today, I presented on “Postgres in Amazon RDS” topic at Postgres Open Conference in Chicago. Here is the slide deck:

↧

Jeff Frost: WAL-E with Rackspace CloudFiles over Servicenet

September 18, 2014, 12:19 pm

≫ Next: Joshua Drake: Along the lines of GCE, here are some prices

≪ Previous: Denish Patel: Postgres in Amazon RDS

Found a great walkthrough on setting up WAL-E to use python-swiftclient for storage in Rackspace Cloud Files: https://developer.rackspace.com/blog/postgresql-plus-wal-e-plus-cloudfiles-equals-awesome/

Unfortunately by default, your backups use the public URL for Cloud Files and eat into metered public bandwidth.

The way to work around this is to set the endpoint_type to internalURL instead of the default publicURL.

You do that by setting the following environment variable:

SWIFT_ENDPOINT_TYPE='internalURL'

That allows WAL-E to use Servicenet for base backups and WAL archiving which will be much faster and not eat into your metered public bandwidth.

↧

Joshua Drake: Along the lines of GCE, here are some prices

September 18, 2014, 1:38 pm

≫ Next: Tomas Vondra: Introduction to MemoryContexts

≪ Previous: Jeff Frost: WAL-E with Rackspace CloudFiles over Servicenet

I was doing some research for a customer who wanted to know where the real value to performance is. Here are some pricing structures between GCE, AWS and Softlayer. For comparison Softlayer is bare metal versus virtual.

GCE: 670.00
16 CPUS
60G Memory
2500GB HD space

GCE: 763.08
16 CPUS
104G Memory
2500GB HD space

Amazon: 911.88
16 CPUS
30G Memory
3000GB HD Space

Amazon: 1534.00
r3.4xlarge
16 CPUS
122.0 Memory
SSD 1 x 320
3000GB HD Space

Amazon: 1679.00
c3.8xlarge
32 CPUS
60.0 Memory
SSD 2 x 320
3000GB HD Space

None of the above include egress bandwidth charges. Ingress is free.

Softlayer: ~815 (with 72GB memory ~ 950)
16 Cores
RAID 10
4TB (4 2TB drives)
48GB Memory

Softlayer: ~1035 (with 72GB memory ~ 1150)
16 Cores
RAID 10
3TB (6 1TB drives, I also looked at 8-750GB and the price was the same. Lastly I also looked at using 2TB drives but the cost is all about the same)
48GB Memory

↧

Tomas Vondra: Introduction to MemoryContexts

September 18, 2014, 5:00 pm

≫ Next: gabrielle roth: Happy 10th Birthday, Portal!

≪ Previous: Joshua Drake: Along the lines of GCE, here are some prices

If I had to name one thing that surprised me the most back when I started messing with C and PostgreSQL, I'd probably name memory contexts. I never met this concept before, so it seemd rather strange, and there's not much documentation introducing it. I recently read an interesting paper summarizing architecture of a database system (by Hellerstein, Stonebraker and Hamilton), and there's actually devote a whole section (7.2 Memory Allocator) to memory contexts (aka allocators). The section explicitly mentions PostgreSQL as having a fairly sophisticated allocator, but sadly it's very short (only ~2 pages) and describes only the general ideas, without going discussing the code and challenges - which is understandable, because the are many possible implementations. BTW the paper is very nice, definitely recommend reading it.

But this blog is a good place to present details of the PostgreSQL memory contexts, including the issues you'll face when using them. If you're a seasoned PostgreSQL hacker, chances are you know all of this (feel free to point out any inaccuracies), but if you're just starting hacking PostgreSQL in C, this blog post might be useful for you.

Now, when I said there's not much documentation about memory contexts, I was lying a bit. The are plenty of comments in memutils.h and aset.c, explaining the internals quite well - but who reads code comments, right? Also, you can only read them when you realize how important memory contexts are (and find the appropriate files). Another issue is that the comments only explain "how it works" and not some of the consequences (like, palloc overhead, for example).

Motivation

But, why do we even need memory contexts? In C, you simply call malloc whenever you need to allocate memory on heap, and when you're done with the memory, you call free. It's simple and for short programs this is pretty sufficient and manageable, but as the program gets more complex (passing allocated pieces between functions) it becomes really difficult to track all those little pieces of memory. Memory allocated at one place may be passed around and then freed at a completely different part of the code, far far away from the malloc that allocated it. If you free them too early, the application will eventually see garbage, if you free them too late (or never), you get excessive memory usage (or memory leaks).

And PostgreSQL is quite complex code - consider for example how tuples flow throught execution plans. The tuple is allocated at one place, gets passed through sorting, aggregations, various transformations etc. and eventually sent to the client.

Memory contexts are a clever way to deal with this - instead of tracking each little piece of memory separately, each piece is registered somewhere (in a context), and then the whole context is released at once. All you have to choose the memory context, and call palloc/pfree instead of malloc/free.

In the simplest case palloc simply determines the "current" memory context (more on this later), allocates appropriate piece of memory (by calling malloc) and associates is with the current memory context (by storing a pointer in the memory context and some info in a "header" of the allocated piece) and returns it to the caller. Freeing the memory is done either through pfree (which reverses palloc logic) or by freeing the whole memory context (you can see it as pfree loop over all allocated pieces).

This offers multiple optimization options - for example reducing the number of malloc/free calls by keeping a cache of released pieces, etc.

The other thing is granularity and organization of memory contexts. We certainly don't want a single huge memory contexts, because that's almost exactly the same as having no contexts at all. So we know we need multiple contexts, but how many? Luckily, there's a quite natural way to split memory contexts, because all queries are evaluated through execution plans - a tree of operators (scans, joins, aggregations, ...).

Most executor nodes have their own memory context, released once that particular node completes. So for example when you have a join or aggregation, once this step finishes (and passes all the results to the downstream operator), it discards the context and frees the memory it allocated (and didn't free explicitly). Sometimes this is not perfectly accurate (e.g. some nodes create multiple separate memory contexts), but you get the idea.

The link to execution plans also gives us hint on how to organize the memory context - the execution plan is a tree of nodes, and with memory contexts attached to nodes, it's natural to keep the memory contexts organized in a tree too.

That being said, it's worth mentioning that memory contexts are not used only when executing queries - pretty much everything in PostgreSQL is allocated within some a memory context, including "global" structures like various caches, global structures etc. That however does not contradict the tree-ish structure and per-node granularity.

MemoryContext API

The first thing you should probably get familiar with is MemoryContextMethods API which provides generic infrastructure for various possible implementations. It more or less captures the ideas outlined above. The memory context itself is defined as a simple structure:

typedefstructMemoryContextData{NodeTagtype;MemoryContextMethods*methods;MemoryContextparent;MemoryContextfirstchild;MemoryContextnextchild;char*name;boolisReset;}MemoryContextData;

Which allows the tree structure of memory contexts (by parent and first/next child fields). The methods describe what "operations" are available for a context:

typedefstructMemoryContextMethods{void*(*alloc)(MemoryContextcontext,Sizesize);/* call this free_p in case someone #define's free() */void(*free_p)(MemoryContextcontext,void*pointer);void*(*realloc)(MemoryContextcontext,void*pointer,Sizesize);void(*init)(MemoryContextcontext);void(*reset)(MemoryContextcontext);void(*delete_context)(MemoryContextcontext);Size(*get_chunk_space)(MemoryContextcontext,void*pointer);bool(*is_empty)(MemoryContextcontext);void(*stats)(MemoryContextcontext,intlevel);#ifdefMEMORY_CONTEXT_CHECKINGvoid(*check)(MemoryContextcontext);#endif}MemoryContextMethods;

Which pretty much says that each memory context implementation provides methods to allocate, free and reallocate memory (alternatives to malloc, free a realloc) and also methods to manage the contexts (e.g. initialize a new context, destroy it etc.).

There are also several helper methods wrapping this API, forwarding the calls to the proper instance of MemoryContextMethods.

And when I mentioned palloc and pfree before - these are pretty much just additional wrappers on top of these helper methods (grabbing the current context and passing it into the method).

Allocation Set (AllocSet) Allocator

Clearly, the MemoryContext API provides just the infrastructure, and was developer in anticipation of multiple allocators with different features. That however newer happened, and so far there's a single memory context implementation - Allocation set.

This often makes the discussion a bit confusing, because people mix the general concept of memory contexts and the (single) implementation available.

Allocation Set implementation is quite sophisticated (aka complex). Let me quote the first comment in aset.c:

... it manages allocations in a block pool by itself, combining many small allocations in a few bigger blocks. AllocSetFree() normally doesn't free() memory really. It just add's the free'd area to some list for later reuse by AllocSetAlloc(). All memory blocks are free()'d at once on AllocSetReset(), which happens when the memory context gets destroyed.

To explain this a bit - AllocSet allocates blocks of memory (multiples of 1kB), and then "splits" this memory into smaller chunks, to satisfy the actual palloc requests. When you free a chunk (by calling pfree), it can't immediately pass it to free because the memory was allocated as a part of a larger block. So it keeps the chunk for reuse (for similarly-sized palloc requests), which has the nice benefit of lowering the number of malloc calls (and generally malloc-related book-keeping).

This works perfectly once you have palloc calls with a mix of different requests sizes, but once you break this, the results are pretty bad. Similarly, it's possible to construct requests that interact with the logic grouping requests into groups (making it easier to reuse the chunks), resulting in a lot of wasted memory.

There's another optimization for requests over 8kB, that are handled differently - the largest blocks (part of the block pool) are 8kB, and all requests exceeding this are allocated through malloc directly, and freed immediately using free.

The CurrentMemoryContext

Now, let's say you call palloc, which looks almost exactly the same as a malloc call:

char*x=palloc(128);// allocate 128B in the context

So how does it know which memory context to use? It's really simple - the memory context implementation defines a few global variables, tracking interesting memory contexts, and one of them is CurrentMemoryContext which means "we're currently allocating memory in this context."

Earlier I mentioned that each execution node has an associated context - the first thing the memory node may do is setting the associated memory context as the current one. This however is a problem, because the child nodes may do the same, and the execution may be "interleaved" (the nodes are passing tuples in an iterative manner).

Thus what we usually see is this idiom:

MemoryContextoldcontext=MemoryContextSwitchTo(nodecontext);char*x=palloc(128);char*y=palloc(256);MemoryContextSwitchTo(oldcontext)

which keeps the current memory context set to the original value.

Summary

I tried to explain the motivation and basic of memory contexts, and hopefully direct you to the proper source files for more info.

The main points to remember are probably:

Memory contexts group allocated pieces of memory, making it easier to manage lifecycle.
Memory contexts are organized in a tree, roughly matching the execution plans.
There's a generic infrastructure allowing different implementations, but nowadays there's a single implementation - Allocation Set.
It attempts to minimize malloc calls/book-keeping, maximize memory reuse, and never really frees memory.

In the next post I'll look into the usual problems with palloc overhead.

↧

gabrielle roth: Happy 10th Birthday, Portal!

September 18, 2014, 5:21 pm

≫ Next: Josh Berkus: Finding Duplicate Indexes

≪ Previous: Tomas Vondra: Introduction to MemoryContexts

The Portal Project hosted at PSU is near & dear to our hearts here at PDXPUG. (It’s backed by an almost 3TB Postgres database.) We’ve had several talks about this project over the years:

eXtreme Database Makeover (Episode 2): PORTAL– Kristin Tufte
Metro Simulation Database– Jim Cser
R and Postgres– Chris Monsere (I think this is where we first heard about bus bunching)
Extreme Database Makeover – Portal Edition– William van Hevelingin

Kristin Tufte most recently spoke at the PDXPUG PgDay about current development on this project, which is now in its 10th year. Future plans include data from more rural locations, more detailed bus stats, and possibly a new bikeshare program. We look forward to hearing more about it!

↧

Josh Berkus: Finding Duplicate Indexes

September 18, 2014, 5:48 pm

≫ Next: Paul Ramsey: PostGIS Feature Frenzy

≪ Previous: gabrielle roth: Happy 10th Birthday, Portal!

Recently a client asked us to help them find and weed out duplicate indexes. We had some old queries to do this, but they tended to produce a lot of false positives, and in a database with over 2000 indexes that wasn't going to cut it. So I rewrote those queries to make them a bit more intelligent and discriminating, and to supply more information to the user on which to base decisions about whether to drop an index.

Here's the first query, which selects only indexes which have exactly the same columns. Let me explain the columns of output it produces:

schema_name, table_name, index_name: the obvious
index_cols: a comman-delimited list of index columns
indexdef: a CREATE statement for how the index was created, per pg_indexes view
index_scans: the number of scans on this index per pg_stat_user_indexes

Now, go run in on your own databases. I'll wait.

So, you probably noticed that we still get some false positives, yes? That's because an index can have all the same columns but still be different. For example, it could use varchar_pattern_ops, GiST, or be a partial index. However, we want to see those because often they are functionally duplicates of other indexes even though they are not exactly the same. For example, you probably don't need both an index on ( status WHERE cancelled is null ) and on ( status ).

What about indexes which contain all of the columns of another index, plus some more? Like if you have one index on (id, name) you probably don't need another index on just (id). Well, here's a query to find partial matches.

This second query looks for indexes where one index contains all of the same columns as a second index, plus some more, and they both share the same first column. While a lot of these indexes might not actually be duplicates, a lot of them will be.

Obviously, you could come up with other variations on this, for example searching for all multicolumn indexes with the same columns in a different order, or indexes with the same first two columns but others different. To create your own variations, the key is to edit the filter criteria contained in this clause:

WHERE EXISTS ( SELECT 1
    FROM pg_index as ind2
    WHERE ind.indrelid = ind2.indrelid
    AND ( ind.indkey @> ind2.indkey
     OR ind.indkey <@ ind2.indkey )
    AND ind.indkey[0] <> ind2.indkey[0]
    AND ind.indkey <> ind2.indkey
    AND ind.indexrelid <> ind2.indexrelid
)

... and change it to figure out the factors which give you the most real duplicates without missing anything.

Happy duplicate-hunting!

↧

Paul Ramsey: PostGIS Feature Frenzy

September 22, 2014, 4:30 pm

≫ Next: Steve Singer: JSON in Postgres – Toronto Postgres Users Group

≪ Previous: Josh Berkus: Finding Duplicate Indexes

A specially extended feature frenzy for FOSS4G 2014 in Portland. Usually I only frenzy for 25 minutes at a time, but they gave me an hour long session!

PostGIS Feature Frenzy — Paul Ramsey from FOSS4G on Vimeo.

Thanks to the organizers for giving me the big room and big slot!

↧

Steve Singer: JSON in Postgres – Toronto Postgres Users Group

September 22, 2014, 7:33 pm

≫ Next: gabrielle roth: PgOpen 2014 – quick recap

≪ Previous: Paul Ramsey: PostGIS Feature Frenzy

Tonight I presented a talk on using JSON in Postgres at the Toronto Postgres users group. Pivotal hosted the talk at their lovely downtown Toronto office. Turnout was good with a little over 15 people attending (not including the construction workers banging against some nearby windows).

I talked about the JSON and JSONB datatypes in Postgres and some idea for appropriate uses of NoSQL features in a SQL database like Postgres.

My slides are available for download

We are thinking of having lighting and ignite talks for the next meetup. If anyone is in the Toronto area and wants to give a short (5 minute) talk on a Postgres related topic let me know.

↧

gabrielle roth: PgOpen 2014 – quick recap

September 22, 2014, 8:19 pm

≫ Next: Tomas Vondra: Allocation Set internals

≪ Previous: Steve Singer: JSON in Postgres – Toronto Postgres Users Group

Many thanks to the speakers, my fellow conference committee members, and especially our chair, Kris Pennella, for organizing the best PgOpen yet. (Speakers: please upload your slides or a link to your slides to the wiki.) I came back with a big to-do/to-try list: check out Catherine Devlin’s DDL generator, familiarize myself with the FILTER […]

↧

Tomas Vondra: Allocation Set internals

September 23, 2014, 8:00 am

≫ Next: Josh Berkus: Settings for a fast pg_restore

≪ Previous: gabrielle roth: PgOpen 2014 – quick recap

Last week I explained (or attempted to) the basics of memory contexts in PostgreSQL. It was mostly about the motivation behind the concept of memory contexts, and some basic overview of how it all works together in PostgreSQL.

I planned to write a follow-up post about various "gotchas" related to memory contexts, but as I was writing that post I ended up explaining more and more details about internals of AllocSet (the only memory context implementation in PostgreSQL). So I've decided to split that post into two parts - first one (that you're reading) explains the internals of allocation sets. The next post will finally deal with the gotchas.

The level of detail should be sufficient for understanding the main principles (and some tricks) used in the AllocSet allocator. I won't explain all the subtle details - if you're interested in that (and the code is quite readable, if you understand the purpose), please consult the actual code in aset.c. Actually, it might be useful to keep that file opened and read the related functions as I explain what palloc and pfree do (at the end of this post).

blocks

The introductory post mentioned that Allocation Sets are based on "block pools" - that means the allocator requests memory in large blocks (using malloc), and then slices these blocks into smaller pieces to handle palloc requests.

The block sizes are somewhat configurable, but by default the allocator starts with 8kB blocks and every time it needs another block it doubles the size up to 8MB (see the constants at the end of memutils.h). So it first allocates 8kB block, then 16kB, 32kB, 64kB, ... 8MB (and then keeps allocating 8MB blocks). You may change different sizes, but the minimum allowed block size is 1kB, and the block sizes are always 2^N bytes.

To make the management easier, each block is decorated with a header, represented by this structure defined in aset.c:

typedefstructAllocBlockData{AllocSetaset;/* aset that owns this block */AllocBlocknext;/* next block in aset's blocks list */char*freeptr;/* start of free space in this block */char*endptr;/* end of space in this block */}AllocBlockData;

Each block references the memory context it's part of, has a pointer to the previous block (to make a linked list of allocated blocks) and pointers delimiting the free space in the block. The header is stored at the beginning of the block - the AllocBlockData is ~32B if I count correctly, so this overhead is negligible even for the smallest block size (~3%).

Chunks

So the allocator acquires memory in blocks, but how does it provide it to the palloc callers? That's what chunks are about ...

Chunks are pieces of memory allocated within the blocks in response to palloc calls. Just like blocks, each chunk is decorated with a header (this is slightly simplified, compared to the actual structure definition in aset.c):

typedefstructAllocChunkData{void*aset;/* the owning aset */Sizesize;/* usable space in the chunk */}AllocChunkData;

Each chunk mentions which memory context it belongs to (just like blocks), and how much space is available in it.

Just like block sizes, the chunk sizes are not entirely arbitrary - sure, you may request arbitrary amount of memory (using palloc), but the amount of memory reserved in the block will be always 2^N bytes (plus the AllocChunk header, which is usually 16B on 64-bit platforms).

For example, if you request 24 bytes (by calling palloc(24)), the memory context will actually create a 32B chunk (because 32 is the nearest power of 2 after 24), and hand it back to you. So you will (rather unknowingly) 32B of usable space. The smallest allowed request is 8B, so all smaller requests get 8B.

Chunk overhead

Chunks are the main source of overhead in memory contexts.

Firstly, because of the 2^N sizes, there'll usually be some unused space. The bottom 1/2 of the chunk is always utilized (otherwise a smaller chunk would be used), and the upper half is 50% utilized on average, assuming a random distribution of requested sizes. Thus ~75% utilization and 25% overhead on average. That's not negligible, but it's probably a fair price for the features memory contexts provide (and compare it to overhead in garbage-collected languages, for example).

Secondly, there's the chunk header. A chunk with 32B usable space will actually occupy 48B. The impact of this really depends on the request size. For large requests it gets negligible (similarly to the block header), but as you can imagine for very small requests it gets very high (e.g. 200% for 8B, which is the smallest allowed request).

Chunk reuse

But why are the chunk sizes treated like this? The answer is "reuse." When you call pfree, the chunk is not actually returned to the system (using free). It actually can't be, because it's only a small part of the block (which is what we got from malloc), and there may be more chunks on it. So we can't just free the chunk or block.

Instead, the chunk is moved to a "freelist" - a list of empty chunks, and may be used for future palloc calls (requesting chunk with the same size). There's one freelist for each chunk size, so that we can easily get chunk of the "right" size (if there's one). Without using these "unified" chunk sizes, this would not be really possible - the size matching would be more expensive or less efficient.

Still, there are ways this can fail tremendously (e.g. when the assumtion of random distribution of sizes does not hold, or when the reuse does not work for some reason). More on these failures in the next post.

Oversized chunks

If there's a maximum block size, what happens to requests exceeding this limit? The current implementation creates a special one-chunk block for each such oversized chunk, with almost exactly the right size (no 2^N rules, just some basic alignment tweaks). The purpose of the block pool is to prevent overhead with allocating too many tiny pieces, so allocating large blocks directly is not a big deal (before you allocate enough of them to notice the overhead, you'll likely run out of memory).

More precisely - oversized chunks are not chunks exceeding max block size, but chunks exceeding 8kB (because the number of freelists is limited) or exceeding 1/8 of max block size (so that a stream of such chunks results in 1/8 overhead).

Anyway, the thing you should remember is that large chunks are special-cased - are allocated in dedicated blocks with the right size, and pfree immediately frees them.

Now, let's see what the two main methods - palloc and pfree do.

palloc (AllocSetAlloc)

Assuming the CurrentMemoryContext is set to an instance of allocation set allocator (and it's difficult to break this assumption, because there are no other implementations), the flow after calling palloc (see mcxt.c for implementation) looks about like this:

palloc(size)->CurrentMemoryContext->alloc(CurrentMemoryContext,size)->AllocSetAlloc(CurrentMemoryContext,size)

That is, the palloc looks at CurrentMemoryContext, and invokes the alloc method (which is part of MemoryContextMethods API, described in the previous post). As it's an AllocSet instance, it invokes AllocSetAlloc which then does all the work with block and chunks.

In short, it does about this:

Checks whether the size is exceeding the "oversized chunk" limit - in that case, a dedicated block (of just the right size) is allocated.
Checks whether there's an existing chunk in the freelist with the proper size, and if yes then uses it.
Checks whether there already is a block, and if there's enough space in it (that's what the block header is for). A new block is allocated (using malloc) if necessary.
Reserves sufficient space in the block for the chunk, and returns it to the caller.

There are a few more tricks (e.g. what would you do with the remaining space in a block?) - the code is easy to read and quite well commented, althouth rather long. Not sure what happened to the "Every function should fit on a screen" recommendation - either they had extremely large screens, or small font size ;-)

pfree (AllocSetFree)

The pfree does not need to consult the CurrentMemoryContext - and it can't, because the chunk could have been allocated in a completely different context. But hey, that's what the chunk header is for - there's a pointer to the proper AllocSet instance. So the flow looks about this:

pfree(chunk)header=chunk-16B->header->aset->free(header->aset,header)->AllocSetFree(header->aset,header)

And AllocSetFree doesn't do much:

Check if this is "oversized" chunk (those can and should be freed immediately).
If it's a regular chunk, move it to a freelist, so that it can be reused by subsequent palloc calls.

Summary

There's a single memory context implementation - Allocation Set.
It requests memory in "large" blocks, with configurable min/max sizes. The smallest block size is 1kB, the sizes follow 2^N rule.
The blocks are kept until the whole context is destroyed.
The blocks are then carved into chunks, used to satisfy palloc requests.
Both chunks and blocks have a header linking them to the memory context.
Chunks sizes follow the 2^N rule too, palloc requests are satisfied using the smallest sufficient chunk size.
Chunks are not freed, but moved to a freelist for reuse by suceeding palloc calls.
Large requests (over 8kB or 1/8 of max block size) are handled using dedicated single-chunk blocks.

↧

Josh Berkus: Settings for a fast pg_restore

September 23, 2014, 10:51 am

≫ Next: Leo Hsu and Regina Obe: FOSS4G and PGOpen 2014 presentations

≪ Previous: Tomas Vondra: Allocation Set internals

One thing which we do a lot for clients is moving databases from one server to another via pg_dump and pg_restore. Since this process often occurs during a downtime, it's critical to do the pg_dump and pg_restore as quickly as possible. Here's a few tips:

Use the -j multiprocess option for pg_restore (and, on 9.3, for pg_dump as well). Ideal concurrency is generally two less than the number of cores you have, up to a limit of 8. Users with many ( > 1000) tables will benefit from even higher levels of concurrency.
Doing a compressed pg_dump, copying it (with speed options), and restoring on the remote server is usually faster than piping output unless you have a very fast network.
If you're using binary replication, it's faster to disable it while restoring a large database, and then reclone the replicas from the new database. Assuming there aren't other databases on the system in replication, of course.
You should set some postgresql.conf options for fast restore.

"What postgresql.conf options should I set for fast restore?" you ask? Well, let me add a few caveats first:

The below assumes that the restored database will be the only database running on the target system; they are not safe settings for production databases.
It assumes that if the pg_restore fails you're going to delete the target database and start over.
These settings will break replication as well as PITR backup.
These settings will require a restart of PostgreSQL to get to production settings afterwards.

shared_buffers = 1/2 of what you'd usually set
maintenance_work_mem = 1GB-2GB
fsync = off
synchronous_commit = off
wal_level = minimal
full_page_writes = off
wal_buffers = 64MB
checkpoint_segments = 256 or higher
max_wal_senders = 0
wal_keep_segments = 0
archive_mode = off
autovacuum = off
all activity logging settings disabled

Some more notes:

you want to set maintenance_work_mem as high as possible, up to 2GB, for building new indexes. However, since we're doing concurrent restore, you don't want to get carried away; your limit should be (RAM/(2*concurrency)), in order to maintain somewhat of an FS buffer. This is a reason why you might turn concurrency down, if you have only a few large tables in the database.
checkpoint_segments should be set high, but requires available disk space, at the rate of 1GB per 32 segments. This is in addition to the space you need for the database.

Have fun!

↧

Leo Hsu and Regina Obe: FOSS4G and PGOpen 2014 presentations

September 23, 2014, 11:55 am

≫ Next: Craig Ringer: Compiling and debugging PostgreSQL’s PgJDBC under Eclipse

≪ Previous: Josh Berkus: Settings for a fast pg_restore

At FOSS4G we gave two presentations. The videos from other presentations are FOSS4G 2014 album. I have to commend the organizers for gathering such a rich collection of videos. Most of the videos turned out very well and are also downloadable in MP4 in various resolutions. It really was a treat being able to watch all I missed. I think there are still some videos that will be uploaded soon. As mentioned lots of PostGIS/PostgreSQL talks (or at least people using PostGIS/PostgreSQL in GIS).

Continue reading "FOSS4G and PGOpen 2014 presentations"

↧

Craig Ringer: Compiling and debugging PostgreSQL’s PgJDBC under Eclipse

September 23, 2014, 9:51 pm

≫ Next: John Graber: I Shall Return... to Postgres Open

≪ Previous: Leo Hsu and Regina Obe: FOSS4G and PGOpen 2014 presentations

I’ve always worked on PgJDBC, the JDBC Type 4 driver for PostgreSQL, with just a terminal, ant and vim. I recently had occasion to do some PgJDBC debugging work on Windows specifics so I set up Eclipse to avoid having to work on the Windows command prompt.

As the process isn’t completely obvious, here’s how to set up Eclipse Luna to work with the PgJDBC sources.

If you don’t have it already, download JDK 1.8 and Eclipse Luna.

Now download JDK 1.6 from Oracle (or install the appropriate OpenJDK). You’ll need an older JDK because the JDK doesn’t offer a way to mask out new classes and interfaces from the standard library when compiling for backward compatibility – so you may unwittingly use Java 7 or Java 8-only classes unless you target the Java 6 VM.

(If you want to work with the JDBC3 driver you need Java 5 instead, but that’s getting seriously obsolete now).

Now it’s time to import the sources and configure Eclipse so it knows how to work with them:

Register JDK 1.6 with Eclipse:
- Window -> Preferences, Java, Installed JREs, Add…
- Set JRE Name to “JDK 1.6″
- Set JRE Home to the install directory of the JDK – not the contained JRE. It’ll be a directory named something like jdk1.6.0_45 and it’ll have a jre subdirectory. Select the jdk1.6.0_45 directory not the jre subdir.
- Finish…
Import from git and create a project:
- File -> Import, Git, Projects from Git, then follow the prompts for the repo
- After the clone completes, Eclipse will prompt you to “Select a wizard for importing projects from git”. Choose Import as a General Project
- Leave the project name as pgjdbc and hit Finish
Configure ant
- Get properties on the “pgjdbc” project
- Go to the Builders tab
- Press New…
- In the resulting dialog’s Main tab set the Buildfile to the build.xml in the root of the PgJDBC checkout and set the base directory to the checkout root, e.g. buildfile = ${workspace_loc:/pgjdbc/build.xml} and base directory = ${workspace_loc:/pgjdbc}
- In the Classpath tab, Add… the maven-ant-tasks jar, lib/maven-ant-tasks-2.1.3.jar from the PgJDBC lib dir
- In the JRE tab, choose Separate JRE and then select the JDK 1.6 entry you created earlier
Build the project – Project -> Build Project

Done! You can now modify and debug PgJDBC from within Eclipse, including setting breakpoints in your PgJDBC project and having them trap when debugging another project that’s using the same jar.

It isn’t completely automagic like with a native Eclipse project, though. To debug the driver from another project, you’ll need to add the driver from the jars/ directory produced by the build to your other project’s build path. It’ll be named something like postgresql-9.4-1200.jdbc4.jar. Add it under the project’s Properties, in Java Build Path -> Libraries -> Add External JARs.

Once you’ve added the driver JAR you can use your modified JAR, but breakpoints you set in the driver sources won’t get tripped when debugging another project. To enable that, tab open the newly added driver JAR in the Libraries tab and edit “Source attachment…”. Choose “Workspace Location” and, when prompted, your PgJDBC driver project.

The project sources are now linked. You can set breakpoints either by exploring the PgJDBC jar from the project that uses PgJDBC to find the package and source file required, or by opening the source file in the PgJDBC project. Either way it’ll work.

You’ll probably also want to set the loglevel=2 parameter to PgJDBC so that it emits trace logging information.

↧

John Graber: I Shall Return... to Postgres Open

September 24, 2014, 7:58 am

≫ Next: US PostgreSQL Association: PgUS: Welcomes Dallas/Forth Worth PostgreSQL User Group

≪ Previous: Craig Ringer: Compiling and debugging PostgreSQL’s PgJDBC under Eclipse

Last week, I had the opportunity to attend my first Postgres conference, that being +Postgres Open 2014. Two days packed with a ridiculous amount of useful information and fantastic Postgres community interaction. And now that the conference & sponsor t-shirts are washed, many new social media connections have been made, and the backlog of work email has been addressed —

Many Thanks

First off, the program committee deserves a veritable heap of congratulations and gratitude piled upon them from the attendees. From the quality of the speakers to the well planned (and orchestrated) schedule, and even to the menu — everything was top notch. And therefore, I am calling you out by name, +Kris Pennella, +Gabrielle R,+Cindy Wise,+Stephen Frost,+Jonathan Katz,& +Selena Deckelmann. Thanks for your time and effort!

The Talks

As with any (great) conference, the difficult part is the inability to attend multiple presentations at the same time, and thus being forced to choose. The schedule this year offered no shortage of these situations. But when your choices include community figures of the likes of +Bruce Momjian presenting on "Explaining the Postgres Query Optimizer" and +Simon Riggs with "How VACUUM works, and what to do when it doesn't", you simply can't go wrong. Nonetheless, I wish I hadn't had to miss out on the talks given by +denish patel , +Álvaro Hernández Tortosa, +Gurjeet Singh, and +Vibhor Kumar, to name but a few.

Fortunately, the talks were recorded and will be uploaded to the Postgres Open YouTube channel in about a month or so. In the meantime, the presenters' slides are available on the Postgres Open wiki.

Not Just for DBAs

I particularly want to call +Dimitri Fontaine's talk, "PostgreSQL for developers" to the attention of developer crowd and as being an excellent example of why these conferences are not just for DBAs. As a consultant, on virtually every job I see examples of poorly written SQL as being the root cause of a poorly performing database (even after proper tuning of config parameters). And therefore, Dimitri's admonition that you must treat SQL just as you would any other language in which you write code rings particularly true with me. I tweeted it once, and I'll say it again — this talk could lead to peace and love between developers and DBAs. (Not to mention that the techniques he teaches in the examples walked through will blow your mind.)

Milwaukee's Best

On a related note, I'd like to point out that +pgMKE, Milwaukee's Postgres User Group, was well represented in Chicago. +Jeff Amiel's talk on "Row-Estimation Revelation and the Monolithic Query" was described as being worth the cost of attending the conference. And with myself and +Phil Vacca in attendance, we had a sizable portion of our merry band of less than 10 members present. Not bad for a user group that's less than a few months old!

Postgres People

As I said at the outset, this was my first Postgres conference. And I have to add that the people I met were the highlight of the whole affair. It was a great group, many of whom I'd either never met, or only knew online. It was readily apparent that the accelerating success of Postgres on many fronts is largely due to the efforts and energy of this community. It was great to experience it first hand and in action. It may be stating the obvious, but I'm already looking forward to next year.

↧