Andreas Scherbaum: Wrap-up: MADlib Google Summer of Code

September 9, 2014, 11:38 am

≫ Next: Daniel Pocock: xTupleCon WebRTC talk schedule change, new free event

≪ Previous: Jehan-Guillaume (ioguix) de Rorthais: Btree bloat query changelog - part 3

Andreas 'ads' Scherbaum

Google Summer of Code 2014 is wrapped up: Maxence Ahlouche did an excellent job implementing one new algorithm for MADlib and refactored the code base for another one.

I posted a more detailled explanation in the Pivotal blog.

↧

Daniel Pocock: xTupleCon WebRTC talk schedule change, new free event

September 9, 2014, 11:51 am

≫ Next: Hans-Juergen Schoenig: Checking per-memory context memory consumption

≪ Previous: Andreas Scherbaum: Wrap-up: MADlib Google Summer of Code

As mentioned in my earlier blog, I'm visiting several events in the US and Canada in October and November. The first of these, the talk about WebRTC in CRM at xTupleCon, has moved from the previously advertised timeslot to Wednesday, 15 October at 14:15.

WebRTC meeting, Norfolk, VA

Later that day, there will be a WebRTC/JavaScript meetup in Norfolk hosted at the offices of xTuple. It is not part of xTupleCon and free to attend. Please register using the Eventbrite page created by xTuple.

This will be a hands on event for developers and other IT professionals, especially those in web development, network administration and IP telephony. Please bring laptops and mobile devices with the latest versions of both Firefox and Chrome to experience WebRTC.

Free software developers at xTupleCon

If you do want to attend xTupleCon itself, please contact xTuple directly through this form for details about the promotional tickets for free software developers.

↧

Hans-Juergen Schoenig: Checking per-memory context memory consumption

September 10, 2014, 1:08 am

≫ Next: Paul Ramsey: PostGIS 2.1.4 Released

≪ Previous: Daniel Pocock: xTupleCon WebRTC talk schedule change, new free event

Writing a complex database server like PostgreSQL is not an easy task. Especially memory management is an important task, which needs special attention. Internally PostgreSQL makes use of so called “memory contexts”. The idea of a memory context is to organize memory in groups, which are organized hierarchically. The main advantage is that in case […]

↧

Paul Ramsey: PostGIS 2.1.4 Released

September 9, 2014, 5:00 pm

≫ Next: Jehan-Guillaume (ioguix) de Rorthais: Bloat estimation for tables

≪ Previous: Hans-Juergen Schoenig: Checking per-memory context memory consumption

The 2.1.4 release of PostGIS is now available.

The PostGIS development team is happy to release patch for PostGIS 2.1, the 2.1.4 release. As befits a patch release, the focus is on bugs, breakages, and performance issues

http://download.osgeo.org/postgis/source/postgis-2.1.4.tar.gz

Continue Reading by clicking title hyperlink ..

↧

Jehan-Guillaume (ioguix) de Rorthais: Bloat estimation for tables

September 10, 2014, 9:30 am

≫ Next: Chris Travers: Math and SQL Part 6: The Problem with NULLs

≪ Previous: Paul Ramsey: PostGIS 2.1.4 Released

After my Btree bloat estimation query, I found some time to work on a new query for tables. The goal here is still to have a better bloat estimation using dedicated queries for each kind of objects.

Compare to the well known bloat query, this query pay attention to:

TOAST
headers of variable length types
easier to filter or parse

You’ll find the queries here:

from PostgreSQL 7.4 to 8.1: https://gist.github.com/ioguix/f849b1bd31be55da2d7f
from PostgreSQL 8.2 to 8.4: https://gist.github.com/ioguix/74769c8fe5edc582a61b
for PostgreSQL 9.0 and after: https://gist.github.com/ioguix/4f95917f90c9e26df1b2

Tests

I created the file sql/bloat_tables.sql with the 9.0 and more query version. I edited the query to add the bloat reported by pgstattuple (free_percent + dead_tuple_percent) to compare both results and added the following filter:

-- remove Non Applicable tablesNOTis_na-- remove tables with real bloat < 1 blockANDtblpages*((pst).free_percent+(pst).dead_tuple_percent)::float4/100>=1-- filter on table name using the parameter :tblnameANDtblnameLIKE:'tblname'

Here is the result on a fresh pagila database:

postgres@pagila=#\settblname%postgres@pagila=#\isql/bloat_tables.sql current_database | schemaname |    tblname     | real_size | bloat_size | tblpages | is_na |   bloat_ratio    | real_frag ------------------+------------+----------------+-----------+------------+----------+-------+------------------+----------- pagila           | pg_catalog | pg_description |    253952 |       8192 |       31 | f     |  3.2258064516129 |      3.34 pagila           | public     | city           |     40960 |       8192 |        5 | f     |               20 |     20.01 pagila           | public     | customer       |     73728 |       8192 |        9 | f     | 11.1111111111111 |     11.47 pagila           | public     | film           |    450560 |       8192 |       55 | f     | 1.81818181818182 |      3.26 pagila           | public     | rental         |   1228800 |     131072 |      150 | f     | 10.6666666666667 |      0.67(5 rows)

Well, not too bad. Let’s consider the largest table, clone it and create some bloat:

postgres@pagila=#createtablefilm2asselect*fromfilm;SELECT 1000postgres@pagila=#analyzefilm2;ANALYZEpostgres@pagila=#\settblnamefilm%postgres@pagila=#\isql/bloat_tables.sql current_database | schemaname | tblname | real_size | bloat_size | tblpages | is_na |   bloat_ratio    | real_frag ------------------+------------+---------+-----------+------------+----------+-------+------------------+----------- pagila           | public     | film    |    450560 |       8192 |       55 | f     | 1.81818181818182 |      3.26 pagila           | public     | film2   |    450560 |       8192 |       55 | f     | 1.81818181818182 |      3.26(2 rows)postgres@pagila=#deletefromfilm2wherefilm_id<250;DELETE 249postgres@pagila=#analyzefilm2;ANALYZEpostgres@pagila=#\settblnamefilm2postgres@pagila=#\isql/bloat_tables.sql current_database | schemaname | tblname | real_size | bloat_size | tblpages | is_na |   bloat_ratio    | real_frag ------------------+------------+---------+-----------+------------+----------+-------+------------------+----------- pagila           | public     | film2   |    450560 |     122880 |       55 | f     | 27.2727272727273 |     27.29(1 row)

Again, the bloat reported here is pretty close to the reality!

Some more tests:

postgres@pagila=#deletefromfilm2wherefilm_id<333;DELETE 83postgres@pagila=#analyzefilm2;ANALYZEpostgres@pagila=#\isql/bloat_tables.sql current_database | schemaname | tblname | real_size | bloat_size | tblpages | is_na |   bloat_ratio    | real_frag ------------------+------------+---------+-----------+------------+----------+-------+------------------+----------- pagila           | public     | film2   |    450560 |     155648 |       55 | f     | 34.5454545454545 |     35.08(1 row)postgres@pagila=#deletefromfilm2wherefilm_id<666;DELETE 333postgres@pagila=#analyzefilm2;ANALYZEpostgres@pagila=#\isql/bloat_tables.sql current_database | schemaname | tblname | real_size | bloat_size | tblpages | is_na |   bloat_ratio    | real_frag ------------------+------------+---------+-----------+------------+----------+-------+------------------+----------- pagila           | public     | film2   |    450560 |     303104 |       55 | f     | 67.2727272727273 |     66.43(1 row)

Good, good, good. What next?

The alignment deviation

You might have noticed I did not mentioned this table with a large deviation between the statistical bloat and the real one, called “rental”:

postgres@pagila=#\settblnamerentalpostgres@pagila=#\isql/bloat_tables.sql current_database | schemaname | tblname | real_size | bloat_size | tblpages | is_na |   bloat_ratio    | real_frag ------------------+------------+---------+-----------+------------+----------+-------+------------------+----------- pagila           | public     | rental  |   1228800 |     131072 |      150 | f     | 10.6666666666667 |      0.67(1 row)

This particular situation is exactly why I loved writing these bloat queries (including the btree one), confronting the statistics and the reality and finding a logical answer or a fix.

Statistical and real bloat are actually both right here. The statistical one is just measuring here the bloat AND something else we usually don’t pay attention to. I’ll call it the alignment overhead.

Depending on the fields types, PostgreSQL adds some padding before the values to align them inside the row in regards to the CPU word size. This help ensuring a value fits in only one CPU register when possible. Alignment padding are given in this pg_type page from PostgreSQL document, see field typalign.

So let’s demonstrate how it influence the bloat here. Back to the rental table, here is its definition:

postgres@pagila=#\drental                                          Table "public.rental"    Column    |            Type             |                         Modifiers                          --------------+-----------------------------+------------------------------------------------------------ rental_id    | integer                     | not null default nextval('rental_rental_id_seq'::regclass) rental_date  | timestamp without time zone | not null inventory_id | integer                     | not null customer_id  | smallint                    | not null return_date  | timestamp without time zone |  staff_id     | smallint                    | not null last_update  | timestamp without time zone | not null default now()

All the fields here are fixed-size types, so it is quite easy to compute the row size:

rental_id and inventory_id are 4-bytes integers, possible alignment is every 4 bytes from the begining of the row
customer_id and staff_id are 2-bytes integers, possible alignment is every 2 bytes from the begining of the row
rental_date, return_date and last_update are 8-bytes timestamps, possible alignment is every 8 bytes from the begining of the row

The minimum row size would be 2*4 + 2*2 + 3*8, 36 bytes. Considering the alignment optimization and the order of the fields, we now have (ascii art is easier to explain):

|0     1     2     3     4     5     6     7     8     |
|       rental_id       |***********PADDING************|
|                     rental_date                      |
|     inventory_id      |customer_id|******PADDING*****|
|                     return_date                      |
| staff_id  |*****************PADDING******************|
|                     last_update                      |

That makes 12 bytes of padding and a total row size of 48 bytes instead of 36. Here are the 10%! Let’s double check this by the experience:

postgres@pagila=#createtablerental2asselectrental_date,return_date,last_update,rental_id,inventory_id,customer_id,staff_idfrompublic.rental;SELECT 16044postgres@pagila=#\drental2                 Table "public.rental2"    Column    |            Type             | Modifiers --------------+-----------------------------+----------- rental_date  | timestamp without time zone |  return_date  | timestamp without time zone |  last_update  | timestamp without time zone |  rental_id    | integer                     |  inventory_id | integer                     |  customer_id  | smallint                    |  staff_id     | smallint                    | postgres@pagila=#\dt+rental*                      List of relations Schema |  Name   | Type  |  Owner   |  Size   | Description --------+---------+-------+----------+---------+------------- public | rental  | table | postgres | 1200 kB |  public | rental2 | table | postgres | 1072 kB | (2 rows)postgres@pagila=#select100*(1200-1072)::float4/1200;     ?column?     ------------------ 10.6666666666667(1 row)

Removing the “remove tables with real bloat < 1 block” filter from my demo query, we have now:

postgres@pagila=#\settblnamerental%postgres@pagila=#\isql/bloat_tables.sql current_database | schemaname | tblname | real_size | bloat_size | tblpages | is_na |   bloat_ratio    | real_frag ------------------+------------+---------+-----------+------------+----------+-------+------------------+----------- pagila           | public     | rental  |   1228800 |     131072 |      150 | f     | 10.6666666666667 |      0.67 pagila           | public     | rental2 |   1097728 |          0 |      134 | f     |                0 |      0.41(2 rows)

Great!

Sadly, I couldn’t find a good way to measure this in the queries so far, so I will live with that. By the way, this alignment overhead might be a nice subject for a script measuring it per tables.

Known issues

The same than for the Btree statistical bloat query: I’m pretty sure the query will have a pretty bad estimation with array types. I’ll investigate about that later.

Cheers, and happy monitoring!

↧

Chris Travers: Math and SQL Part 6: The Problem with NULLs

September 11, 2014, 6:25 am

≫ Next: Keith Fiske: A Large Database Does Not Mean Large shared_buffers

≪ Previous: Jehan-Guillaume (ioguix) de Rorthais: Bloat estimation for tables

This will be the final installment on Math and SQL and will cover the problem with NULLs. NULL handling is probably the most poorly thought-out feature of SQL and is inconsistent generally with the relational model. Worse, a clear mathematical approach to NULLs is impossible with SQL because too many different meanings are attached to the same value.

Unfortunately, nulls are also indispensable because wider tables are more expressive than narrower tables. This makes advice such as "don't allow nulls in your database" somewhat dangerous because one ends up having to add them back in fairly frequently.

At the same time understanding the problems that NULLs introduce is key to avoiding the worst of the problems and managing the rest.

Definition of a Null Set

A null set is simply a set with no members. This brings us to the most obvious case of the use of a NULL, used when an outer join results in a row not being found. This sort of use by itself doesn't do too much harm but the inherent semantic ambiguity of "what does that mean?" also means you can't just substitute join tables for nullable columns and solve the problems that NULLs bring into the database. This will hopefully become more clear below.

Null as Unknown

The first major problem surfaces when we ask the question, "when I do a left join and the row to the right is not found, does that mean we don't know the answer yet or that there is no value associated?" In all cases, a missing result from an outer join will sometimes mean that the answer is not yet known, if only because we are still inserting the data in stages. But it can also mean that maybe there is an answer and that there is no value associated. In almost all databases, this may also be the case in this situation.

But then there is no additional harm done in allowing NULLs to represent unknowns in the tables themselves, right?

Handling NULLs as unknown values complicates database design and introduces problems so many experts like Chris Date tend to be generally against their use. The problem is that using joins doesn't solve the problem but instead only creates additional failure cases to be aware of. So very often times, people do use NULL in the database to mean unknown despite the problems.

NULL as unknown introduces problems to predicate logic because it introduces three value logic (true, false, and unknown), but these are typically only problems when one is storing a value (as opposed to a reference such as a key) in the table. 1 + NULL IS NULL. NULL OR FALSE IS NULL. NULL OR TRUE IS TRUE. This makes things complicated. But sometimes we must....

Null as Not Applicable

One severe antipattern that is frequently seen is the use of NULL to mean "Not Applicable" or "No Value." There are a few data types which have no natural empty/no-op types. Prime among these are numeric types. Worse, Oracle treats NULL as the same value as an empty string for VARCHAR types.

Now, the obvious problem here is that the database does't know here that NULL is not unknown, and therefore you end up having to track this yourself, use COALESCE() functions to convert to sane values, etc. In general, if you can avoid using NULL to mean "Not Applicable" you will find that worthwhile.

Now, if you have to do this, one strategy to make this manageable is to include other fields to tell you what the null means. Consider for example:

CREATE TABLE wage_class (
id int not null,
label text not null
);

INSERT INTO wage_class VALUES(1, 'salary'), (2, 'hourly');

CREATE TABLE wage (
ssn text not null,
emp_id int not null,
wage_class int not null references wage_class(id),
hourly_wage numeric,
salary numeric,
check (wage_class = 1 or salary is null),
check (wage_class = 2 or hourly_wage is null)
);

This approach allows us to select and handle logic based on the wage class and therefore we know based on the wage_class field whether hourly_wage is applicable or not. This is far cleaner and allows for better handling in queries than just putting nulls in and expecting them to be semantically meaningful. This solution can also be quite helpful because it ensures that one does not accidentally process an hourly wage as a salary or vice versa.

What Nulls Do to Predicate Logic

Because NULLs can represent unknowns, they introduce three-valued predicate logic. This itself can be pretty nasty. Consider the very subtle difference between:

WHERE ssn like '1234%' AND salary < 50000

vs

WHERE ssn like '1234%' AND salary < 50000 IS NOT FALSE

The latter will pull in hourly employees as well, as they have a NULL salary.

Nulls and Constraints

Despite all the problems, NULLs have become a bit of a necessary evil. Constraints are a big part of the reason why.

Constraints are far simpler to maintain if they are self-contained in a tuple and therefore require no further table access to verify. This means that wider tables admit to more expression relating to constraints than narrow tables.

In the example above, we can ensure that every hourly employee has no salary, and every salaried employee has no hourly wage. This level of mutual exclusion would not be possible if we were to break off salaries and wages into separate, joined tables.

Nulls and Foreign Keys

Foreign keys are a special case of NULLs where the use is routine and poses no problems. NULL always means "no record referenced" in this context and because of the specifics of three-valued boolean logic, they always drop out of join conditions.

NULLs in foreign keys make foreign key constraints and 5th Normal Form possible in many cases where it would not be otherwise. Consequently they can be used routinely here with few if any ill effects.

What Nulls Should Have Looked Like: NULL, NOVALUE, UNKNOWN

In retrospect, SQL would be cleaner if we could be more verbose about what we mean by a NULL. UNKNOWN could then be reserved for rare cases where we really must need to store a record with incomplete data in it. NULL could be returned from outer joins, and NOVALUE could be used for foreign keys and places where we know the field is not applicable.

↧

Keith Fiske: A Large Database Does Not Mean Large shared_buffers

September 11, 2014, 11:53 am

≫ Next: Leo Hsu and Regina Obe: FOSS4G 2014 televised live

≪ Previous: Chris Travers: Math and SQL Part 6: The Problem with NULLs

A co-worker of mine did a blog post last year that I’ve found incredibly useful when assisting clients with getting shared_buffers tuned accurately.

Setting shared_buffers the hard way

You can follow his queries there for using pg_buffercache to find out how your shared_buffers are actually being used. But I had an incident recently that I thought would be interesting to share that shows how shared_buffers may not need to be set nearly as high as you believe it should. Or it can equally show you that you that you definitely need to increase it. Object names have been sanitized to protect the innocent.

To set the stage, the database total size is roughly 260GB and the use case is high data ingestion with some reporting done on just the most recent data at the time. shared_buffers is set to 8GB. The other thing to note is that this is the only database in the cluster. pg_buffercache is installed on a per database basis, so you’ll have to install it on each database in the cluster and do some additional totalling to figure out your optimal setting in the end.

database=# SELECT c.relname
  , pg_size_pretty(count(*) * 8192) as buffered
  , round(100.0 * count(*) / ( SELECT setting FROM pg_settings WHERE name='shared_buffers')::integer,1) AS buffers_percent
  , round(100.0 * count(*) * 8192 / pg_relation_size(c.oid),1) AS percent_of_relation
 FROM pg_class c
 INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
 INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
 GROUP BY c.oid, c.relname
 ORDER BY 3 DESC
 LIMIT 10;
               relname               | buffered | buffers_percent | percent_of_relation
-------------------------------------+----------+-----------------+---------------------
 table1                              | 7479 MB  |            91.3 |                 9.3
 table2                              | 362 MB   |             4.4 |               100.0
 table3                              | 311 MB   |             3.8 |                 0.8
 table4                              | 21 MB    |             0.3 |               100.0
 pg_attrdef_adrelid_adnum_index      | 16 kB    |             0.0 |               100.0
 table4                              | 152 kB   |             0.0 |                 7.7
 index5                              | 16 kB    |             0.0 |                14.3
 pg_index_indrelid_index             | 40 kB    |             0.0 |                 8.8
 pg_depend_depender_index            | 56 kB    |             0.0 |                 1.0
 pg_cast_source_target_index         | 16 kB    |             0.0 |               100.0

You can see that table1 is taking up a vast majority of the space here and it’s a large table, so only 9% of it is actually in shared_buffers. What’s more interesting though is how much of the space for that table is actually in high demand.

database=# SELECT pg_size_pretty(count(*) * 8192) 
FROM pg_class c
INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
WHERE c.oid::regclass = 'table1'::regclass
AND usagecount >= 2;
  pg_size_pretty
----------------------
 2016 kB

Data blocks that go into and come out of postgres all go through shared_buffers. Just to review the blog post I linked to, whenever a block is used in shared memory, it increments a clock-sweep algorithm that ranges from 1-5, 5 being extremely high use data blocks. This means high usage blocks are likely to be kept in shared_buffers (if there’s room) and low usage blocks will get moved out if space for higher usage ones is needed. We believe that a simple insert or update sets a usagecount of 1. So, now we look at the difference when usage count is dropped to that.

database=# SELECT pg_size_pretty(count(*) * 8192) 
FROM pg_class c
INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
WHERE c.oid::regclass = 'public.ip_addresses_taggings'::regclass
AND usagecount >= 1;
 pg_size_pretty
----------------------
 4946 MB

So the shared_buffers is actually getting filled mostly by the data ingestion process, but relatively very little of it is of any further use afterwards. If anything of greater importance was needed in shared_buffers, there’s plenty of higher priority space and that inserted data would quickly get flushed out of shared memory due to having a low usagecount.

So with having pg_buffercache installed, we’ve found that the below query seems to be a good estimate on an optimal, minimum shared_buffers setting

database=# SELECT pg_size_pretty(count(*) * 8192) as ideal_shared_buffers
 FROM pg_class c
 INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
 INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
 WHERE usagecount >= 3;
 ideal_shared_buffers
----------------------
 640 MB

This is the sort of query you would run after you have had your database running through your expected workload for a while. Also, note my use of the key word minimal. This does not account for unexpected spikes in shared_buffers usage that may occur during a session of reporting queries or something like that. So you definitely want to set it higher than this, but it can at least show you how effectively postgres is using its shared memory. In general we’ve found the typical suggestion of 8GB to be a great starting point for shared_buffers.

So, in the end, the purpose of this post was to show that shared_buffers is something that needs further investigation to really set optimally and there is a pretty easy method to figuring it out once you know where to look.

↧

Leo Hsu and Regina Obe: FOSS4G 2014 televised live

September 11, 2014, 1:37 pm

≫ Next: gabrielle roth: PDXPUGDay 2014 report

≪ Previous: Keith Fiske: A Large Database Does Not Mean Large shared_buffers

If you weren't able to make it to FOSS4G 2014 this year, you can still experience the event Live. All the tracks are being televised live and its pretty good reception. https://2014.foss4g.org/live/. Lots of GIS users using PostGIS and PostgreSQL. People seem to love Node.JS too.

After hearing enough about Node.JS from all these people, and this guy (Bill Dollins), I decided to try this out for myself.

I created a node.js web application - which you can download from here: https://github.com/robe2/node_postgis_express . It's really a spin-off from my other viewers, but more raw. I borrowed the same ideas as Bill, but instead of having a native node Postgres driver, I went for the pure javascript one so its easier to install on all platforms. I also experimented with using base-64 encoding to embed raster output directly into the browser so I don't have to have that silly img src path reference thing to contend with.

↧

gabrielle roth: PDXPUGDay 2014 report

September 11, 2014, 1:55 pm

≫ Next: Craig Ringer: pg_sysdatetime: a simple cross-platform PostgreSQL extension

≪ Previous: Leo Hsu and Regina Obe: FOSS4G 2014 televised live

We had about 50 folks attend the PDXPUGDay 2014 last week, between DjangoCon and Foss4g. A lot of folks were already in town for one of the other confs, but several folks also day tripped from SeaPUG! Thanks for coming on down.

Thanks again to our speakers:
Josh Drake
David Wheeler
Eric Hanson
Veronika Megler
Kristin Tufte
Josh Berkus

(Plus our lightning talk speakers: Josh B, Mark W, and Basil!)

And our sponsors:
2nd Quadrant
iovation
PGX

And of course, PSU for hosting us.

Videos are linked from the wiki.

↧

Craig Ringer: pg_sysdatetime: a simple cross-platform PostgreSQL extension

September 12, 2014, 2:03 am

≫ Next: Barry Jones: Video: SQL vs NoSQL Discussion at UpstatePHP

≪ Previous: gabrielle roth: PDXPUGDay 2014 report

A while ago I wrote about compiling PostgreSQL extensions under Visual Studio– without having to recompile the whole PostgreSQL source tree.

I just finished the pg_sysdatetime extension, which is mainly for Windows but also supports compilation with PGXS on *nix. It’s small enough that it serves as a useful example of how to support Windows compilation in your extension, so it’s something I think is worth sharing with the community.

The actual Visual Studio project creation process took about twenty minutes, and would’ve taken less if I wasn’t working remotely over Remote Desktop on an AWS EC2 instance. Most of the time was taken by the simple but fiddly and annoying process of adding the include paths and library path for the x86 and x64 configurations. That’s necessary because MSVC can’t just get them from pg_config and doesn’t have seem to have user-defined project variables to let you specify a $(PGINSTALLDIR) in one place.

Working on Windows isn’t always fun – but it’s not as hard as it’s often made out to be either. If you maintain an extension but haven’t added Windows support it might be easier than you expect to do so.

Packaging it for x86 and x64 versions of each major PostgreSQL release, on the other hand… well, lets just say we could still use PGXS support for Windows with a “make installer” target.

↧

Barry Jones: Video: SQL vs NoSQL Discussion at UpstatePHP

September 12, 2014, 9:12 pm

≫ Next: Michael Paquier: Postgres 9.5 feature highlight: Logging of replication commands

≪ Previous: Craig Ringer: pg_sysdatetime: a simple cross-platform PostgreSQL extension

Here's the video from the August UpstatePHP meeting in Greenville discussing SQL vs NoSQL and where they are useful for your development process. I represented SQL solutions (*cough* PostgreSQL *cough*) while Benjamin Young represented NoSQL. Ben has actively contributed to CouchDB, worked for Cloudant, Couchbase, organizes the REST Fest Unconference (happening again September 25-27th) and is t...

↧

Michael Paquier: Postgres 9.5 feature highlight: Logging of replication commands

September 14, 2014, 7:24 am

≫ Next: Hubert 'depesz' Lubaczewski: Waiting for 9.5 – Add width_bucket(anyelement, anyarray).

≪ Previous: Barry Jones: Video: SQL vs NoSQL Discussion at UpstatePHP

Postgres 9.5 will come up with an additional logging option making possible to log replication commands that are being received by a node. It has been introduced by this commit.

commit: 4ad2a548050fdde07fed93e6c60a4d0a7eba0622
author: Fujii Masao <fujii@postgresql.org>
date: Sat, 13 Sep 2014 02:55:45 +0900
Add GUC to enable logging of replication commands.

Previously replication commands like IDENTIFY_COMMAND were not logged
even when log_statements is set to all. Some users who want to audit
all types of statements were not satisfied with this situation. To
address the problem, this commit adds new GUC log_replication_commands.
If it's enabled, all replication commands are logged in the server log.

There are many ways to allow us to enable that logging. For example,
we can extend log_statement so that replication commands are logged
when it's set to all. But per discussion in the community, we reached
the consensus to add separate GUC for that.

Reviewed by Ian Barwick, Robert Haas and Heikki Linnakangas.

The new parameter is called log_replication_commands and needs to be set in postgresql.conf. Default is off to not log this new information that may surprise existing users after an upgrade to 9.5 and newer versions. And actually replication commands received by a node were already logged at DEBUG1 level by the server. A last thing to note is that if log_replication_commands is enabled, all the commands will be printed as LOG and not as DEBUG1, which is kept for backward-compatibility purposes.

Now, a server enabling this logging mode...

$ psql -At -c 'show log_replication_commands'
on

... Is able to show replication commands in LOG mode. Here is for example the set of commands set by a standby starting up:

LOG:  received replication command: IDENTIFY_SYSTEM
LOG:  received replication command: START_REPLICATION 0/3000000 TIMELINE 1

This will certainly help utilities and users running audit for replication, so looking forward to see log parsing tools like pgbadger make some nice outputs using this information.

↧

Hubert 'depesz' Lubaczewski: Waiting for 9.5 – Add width_bucket(anyelement, anyarray).

September 14, 2014, 1:33 pm

≫ Next: Chris Travers: LedgerSMB 1.4.0 Released

≪ Previous: Michael Paquier: Postgres 9.5 feature highlight: Logging of replication commands

On 9th of September, Tom Lane committed patch: Add width_bucket(anyelement, anyarray). This provides a convenient method of classifying input values into buckets that are not necessarily equal-width. It works on any sortable data type. The choice of function name is a bit debatable, perhaps, but showing that there's a relationship to the SQL […]

↧

Chris Travers: LedgerSMB 1.4.0 Released

September 14, 2014, 4:36 pm

≫ Next: Abdul Yadi: Delta Table Clean Up in Bucardo 5 Cascaded Slave Replication

≪ Previous: Hubert 'depesz' Lubaczewski: Waiting for 9.5 – Add width_bucket(anyelement, anyarray).

15 September 2014, London. The LedgerSMB project - all-volunteer developers and contributors - today announced LedgerSMB 1.4.0.

Based on an open source code base first released in 1999, the LedgerSMB project was formed in 2006 and saw it's 1.0 release in the same year. It has now seen continuous development for over eight years and that shows no signs of slowing down.

"LedgerSMB 1.4 brings major improvements that many businesses need," said Chris Travers, who helped found the project. "Businesses which do manufacturing or retail, or need features like funds accounting will certainly get much more out of this new release."

Better Productivity

LedgerSMB 1.4 features a redesigned contact management framework that allows businesses to better keep track of customers, vendors, employers, sales leads, and more. Contacts can be stored and categorized, and leads can be converted into sales accounts.

Additionally, a new import module has been included that allows businesses to upload csv text files to import financial transactions and much more. No longer is data entry something that needs to be done entirely by hand or involves customizing the software.

Many smaller enhancements are here as well, For example, shipping labels can now be printed for invoices and orders, user management workflows have been improved,

Better Reporting

The reporting interfaces have been rewritten in LedgerSMB 1.4.0 in order to provide greater flexibility in both reporting and in sharing reports. Almost all reports now include a variety of formatting options including PDF and CSV formats. Reports can also be easily shared within an organization using stable hyperlinks to reports. Additionally the inclusion of a reporting engine means that it is now relatively simple to write third-party reports which offer all these features. Such reports can easily integrate with LedgerSMB or be accessed via a third party web page.

Additionally, the new reporting units system provides a great deal more flexibility in tracking money and resources as they travel through the system. Not only can one track by project or department, but funds accounting and other specialized reporting needs are possible to meet.

Better Integration

Integration of third-party line of business applications is also something which continues to improve. While all integration is possible, owing to the open nature of the code and db structure, it has become easier as more logic is moved to where it can be easily discovered by applications.

There are two major improvement areas in 1.4. First additional critical information, particularly regarding manufacturing and cost of goods sold tracking, has been moved into the database where it can be easily shared by other applications. This also allows for better testability and support. Secondly LedgerSMB now offers a framework for web services, which are currently available for contact management purposes, allowing integrators to more easily connect programs together.

Commercial Options

LedgerSMB isn't just an open source project. A number of commercial companies offer support, hosting, and customization services for this ERP. A list of some of the most prominant commercial companies involved can be found at http://ledgersmb.org/topic/commercial-support

↧

Abdul Yadi: Delta Table Clean Up in Bucardo 5 Cascaded Slave Replication

September 14, 2014, 6:49 pm

≫ Next: US PostgreSQL Association: PgUS Fall Update 2014

≪ Previous: Chris Travers: LedgerSMB 1.4.0 Released

Thanks for Bucardo team for responding my previous post. My cascaded slave replication works as expected.

Today I notice there is still something to do related with delta and track tables.
Single table replication scenario:
Db-A/Tbl-T1 (master) => Db-B/Tbl-T2 (slave) => Db-C/Tbl-T3 (cascaded slave)

Every change on Table T1 replicated to T2, then T2 to T3. After a while, VAC successfully cleans delta and track tables on Db-A. But not on Db-B.

I detect 2 issues:
1. If cascaded replication T2 to T3 successful, the delta table on Db-B is not be cleaned up by VAC.
2. If cascaded replication T2 to T3 failed before VAC schedule, the delta table on Db-B will be cleaned up by VAC. Then, cascaded replication from T2 to T3 losts.

I fix it by modifying SQL inside bucardo.bucardo_purge_delta(text, text):

— Delete all txntimes from the delta table that:
— 1) Have been used by all dbgroups listed in bucardo_delta_targets
— 2) Have a matching txntime from the track table
— 3) Are older than the first argument interval

  myst = 'DELETE FROM bucardo.'

  || deltatable

  || ' USING (SELECT track.txntime AS tt FROM bucardo.'

  || tracktable
|| ' track INNER JOIN bucardo.bucardo_delta_targets bdt ON track.target=bdt.target'

  || ' GROUP BY 1 HAVING COUNT(*) = '

  || drows

  || ') AS foo'

  || ' WHERE txntime = tt'

  || ' AND txntime < now() – interval '

  || quote_literal($1);

Need advice from Bucardo team.

↧

US PostgreSQL Association: PgUS Fall Update 2014

September 15, 2014, 8:44 am

≫ Next: Joshua Drake: GCE, A little advertised cloud service that is perfect for PostgreSQL

≪ Previous: Abdul Yadi: Delta Table Clean Up in Bucardo 5 Cascaded Slave Replication

It has been a little quiet on the U.S. front of late. Alas, summer of 2014 has come and gone and it is time to strap on the gators and get a little muddy. Although we have been relatively quiet we have been doing some work. In 2013 the board appointed two new board members, Jonathan S. Katz and Jim Mlodgeski. We also affiliated with multiple PostgreSQL User Groups:

NYCPUG

PhillyPUG

SeaPUG

PDXPUG

↧

Joshua Drake: GCE, A little advertised cloud service that is perfect for PostgreSQL

September 15, 2014, 9:48 am

≫ Next: gabrielle roth: PDXPUGDay Recap

≪ Previous: US PostgreSQL Association: PgUS Fall Update 2014

Maybe...

I have yet to run PostgreSQL on GCE in production. I am still testing it but I have learned the following:

A standard provision disk for GCE will give you ~ 80MB/s random write.
A standard SSD provisioned disk for GCE will give you ~ 240MB/s.

Either disk can be provisioned as a raw device allowing you to use Linux Software Raid to build a RAID 10 which even further increases speed and reliability. Think about that, 4 SSD provisioned disks in a RAID 10...

The downside I see outside of the general arguments against cloud services (shared tenancy, all your data in a big brother, lack of control over your resources, general distaste for $vendor, or whatever else we in our right minds can think up) is that GCE is current limited to 16 virtual CPUS and 104GB of memory.

What does that mean? Well it means that it is likely that GCE is perfect for 99% of PostgreSQL workloads. By far the majority of PostgreSQL need less than 104GB of memory. Granted, we have customers that have 256GB, 512GB and even more but those are few and far between.

It also means that EC2 is no longer your only choice for dynamic cloud provisioned VMs for PostgreSQL. Give it a shot, the more competition in this space the better.

↧

gabrielle roth: PDXPUGDay Recap

September 15, 2014, 5:49 pm

≫ Next: Chris Travers: PGObject Cookbook Part 1: Introduction

≪ Previous: Joshua Drake: GCE, A little advertised cloud service that is perfect for PostgreSQL

Last weekend we held the biggest PDXPUGDay we’ve had in a while! 5 speakers + a few lightning talks added up to a fun lineup. About 1/3 of the ~50 attendees were in town for FOSS4G; I think the guy from New Zealand will be holding the “visitor farthest from PDXPUG” for a good long […]

↧

Chris Travers: PGObject Cookbook Part 1: Introduction

September 15, 2014, 8:21 pm

≫ Next: Pavel Stehule: nice unix filter pv

≪ Previous: gabrielle roth: PDXPUGDay Recap

Preface

I have decided to put together a PGObject Cookbook, showing the power of this framework. If anyone is interested in porting the db-looking sides to other languages, please let me know. I would be glad to provide whatever help my time and skills allow.

The PGObject framework is a framework for integrated intelligent PostgreSQL databases into Perl applications. It addresses some of the same problems as ORMs but does so in a very different way. Some modules are almost ORM-like and more such modules are likely to be added in the future. However unlike an ORM, PGObject mostly serves as an interface to stored procedures and whatever code generation routines will be added, these are not intended to be quickly changed. Moreover it only supports PostgreSQL because we make extended use of PostgreSQL-only features.

For those who are clearly not interested in Perl, this series may still be interesting as it not only covers how to use the framework but also various problems that happen when we integrate databases with applications. And there are people who should not use this framework because it is not the right tool for the job. For example, if you are writing an application that must support many different database systems, you probably will get more out of an ORM than you will this framework. But you still may get some interesting stuff from this series so feel free to enjoy it.

Along the way this will explore a lot of common problems that happen when writing database-centric applications and how these can be solved using the PGObject framework. Other solutions of course exist and hopefully we can talk about these in the comments.

Much of the content here (outside of the prefaces) will go into a documentation module on CPAN. However I expect it to also be of far more general interest since the problems are common problems across frameworks.

Introduction

PGObject is written under the theory that the database will be built as a server of information and only loosely tied to the application. Therefore stored procedures should be able to add additional parameters without expecting that the application knows what to put there, so if the parameter can accept a null and provide the same answer as before, the application can be assured that the database is still usable.

The framework also includes a fairly large number of other capabilities. As we work through we will go through the main areas of functionality one at a time, building on the simplest capabilities and moving onto the more advanced. In general these capabilities can be grouped into basic, intermediate, and advanced:

Basic Functionality

registered types, autoserialization, and autodeserialization.
The simple stored procedure mapper
Aggregates and ordering
Declarative mapped methods

Intermediate Functionality

The Bulk Loader
The Composite Type stored procedure mapper
The database admin functions

Advanced Functionality

Memoization of Catalog Lookups
Writing your own stored procedure mapper

This series will cover all the above functionality and likely more. As we get through the series, I hope that it will start to make sense and we will start to get a lot more discussion (and hopefully use) surrounding the framework.

Design Principles

The PGObject framework came out of a few years of experience building and maintaining LedgerSMB 1.3. In general we took what we liked and what seemed to work well and rewrote those things that didn't. Our overall approach has been based on the following principles:

SQL-centric: Declarative, hand-coded SQL is usually more productive than application programming languages. The system should leverage hand-coded SQL.
Leveraging Stored Procedures and Query Generators: The system should avoid having people generate SQL queries themselves as strings and executing them. It's better to store them persistently in the db or generate well-understood queries in general ways where necessary.
Flexible and Robust: It should be possible to extend a stored procedure's functionality (and arguments) without breaking existing applications.
DB-centric but Loosely Coupled: The framework assumes that databases are the center of the environment, and that it is a self-contained service in its own right. Applications need not be broken because the db structure changed, and the DB should be able to tell the application what inputs it expects.
Don't Make Unnecessary Decisions for the Developer: Applications may use a framework in many atypical ways and we should support them. This means that very often instead of assuming a single database connection, we instead provide hooks in the framework so the developer can decide how to approach this. Consequently you can expect your application to have to slightly extend the framework to configure it.

This framework is likely to be very different from anything else you have used. While it shares some similarities with iBatis in the Java world, it is unique in the sense that the SQL is stored in the database, not in config files. And while it was originally inspired by a number of technologies (including both REST and SOAP/WSDL), it is very much unlike any other framework I have come across.

Next in Series: Registered Types: Autoserialization and Deserialization between Numeric and Math::BigFloat.

↧

Pavel Stehule: nice unix filter pv

September 16, 2014, 8:14 am

≫ Next: Vasilis Ventirozos: Offsite replication problems and how to solve them.

≪ Previous: Chris Travers: PGObject Cookbook Part 1: Introduction

I search some filter, that can count a processed rows and can to show a progress. It exists and it is pv

# import to vertica
zcat data.sql | pv -s 16986105538 -p -t -r | vsql

ALTER TABLE
0:13:56 [4.22MB/s] [==============>                                                                                               ] 14%

More http://linux.die.net/man/1/pv

↧