Leo Hsu and Regina Obe: The wonders of Any Element

January 8, 2012, 10:30 am

≫ Next: Francisco Figueiredo Jr: ConnectionPool performance improvements

≪ Previous: Andrew Dunstan: When things are different they are not the same

PostgreSQL has this interesting placeholder called anyelement which it has had for a long time and its complement anyarray. They are used when you want to define a function that can handle many types arguments or can output many types of outputs. They are particularly useful for defining aggregates, which we demonstrated in Who's on First and Who's on Last and several other aggregate articles.

Anyelement / anyarray can be used just as conveniently in other functions. The main gotcha is that when you pass in the first anyelement/anyarray all subsequent anyelement / anyarray must match the same data type as the first anyelement / anyarray.

Continue reading "The wonders of Any Element"

↧

Francisco Figueiredo Jr: ConnectionPool performance improvements

January 9, 2012, 7:13 am

≫ Next: Selena Deckelmann: Where to find me at #LCA2012

≪ Previous: Leo Hsu and Regina Obe: The wonders of Any Element

Hi, all!

Today I committed a change to Npgsql which will improve connection pool performance. This change was motivated by Andrew's bug report where he noticed that a lot of threads were waiting to get a new connection from pool.

In order to keep consistence of the pool, Npgsql has to lock access to it. Andrew's problem appeared in a busy server where a lot of threads were trying to get a new connection from the pool. They had to wait in line. And obviously this isn't good.

The current implementation of Npgsql creates a big lock surrounding all the code needed to work with the pool and more! As Andrew noticed in his bug report, I/O operations were being done inside this lock which was contributing to more delays to get a connection from the pool.

So, to fix that, I rewrote connection pool logic to remove this big lock and break it down to smaller ones only when really needed. All the I/O operations were left out of the locks, this way, other threads waiting to get a new connection from the pool don't need to wait for those expensive operations to finish.

I made a lot of tests and could confirm that when I break the code inside the debugger, threads are spread throughout connection pool code as expected instead of waiting in line on the big lock.

As this change is somewhat critical to Npgsql usage, I'd like to ask you to download the code, compile it and give it a try and see if everything is working ok or even better than before. I expect busy servers to be able to increase their raw throughput because it will have to wait less to get connections from the pool.

As always, please, let me know if you have any problems and all feedback is very welcome!

↧

Selena Deckelmann: Where to find me at #LCA2012

January 10, 2012, 4:51 pm

≫ Next: David E. Wheeler: PGXN Has a New Home

≪ Previous: Francisco Figueiredo Jr: ConnectionPool performance improvements

I’m going to be pretty busy while in Melbourne and Ballarat for the next 10 days.

Here’s my itinerary:

AdaCamp Melbourne, Saturday, January 14, all day
Geek Girl Dinner – Saturday, January 14, evening
Gender-focused outreach panel, Monday, January 16, 11:30-12:20pm, Studio 2
Scaling Data with Postgres, Monday, January 16, 4:05pm, Room C001
Mistakes were made, Tuesday, January 17, 1:20pm, Room C001

There’s a rumor that Stewart Smith and I might do a Q&A about databases in the cloud. If it happens, it will involve lots of pessimism and swearing.

Drop me an note if you want to meet up! I’ll be in Ballarat until early Friday morning.

Then I fly back to LA to give a keynote at SCaLE that Sunday (blog post about that coming).

↧

David E. Wheeler: PGXN Has a New Home

January 11, 2012, 8:51 pm

≫ Next: Pavel Golub: Why you cannot create table and PK constraint with the same name

≪ Previous: Selena Deckelmann: Where to find me at #LCA2012

Day before yesterday, I finally got all of PGXN moved to a new server. I had been using a small server owned by my company, Kineticode, and hosted by Command Prompt. That was fine for a while, but CMD was needing its rack space back, and what with my new job, I was shutting down Kineticode, too. It was time to move PGXN elsewhere.

For a while, I got a lot of support and assistance towards moving PGXN to a PostgreSQL community server. Dave, Magnus, and Stefan kindly spun up a VM for me, and gave me permission to install Perl modules from CPAN, provided I supply them with a script to report to Nagios when Perl modules were out of date, which of course I did. This was necessary because I built PGXN with some pretty recent versions of CPAN modules that are not yet available in Debian stable. I was looking forward to getting things running and integrating with the community authentication service.

I got the server built, and everything was working reasonably well. Magnus and I were just working out some issues with the proxy server configuration, and I was starting to think about how to migrate the data over. But first, I decided to refactor the Perl module script to use a more efficient implementation. I fired it off and piped its output to the cpan utility to just get everything updated. Unfortunately, unlike my first implementation, which reported only on CPAN-installed modules, this version of the script also reported when Debian-installed modules were out-of-date. And since I have my CPAN build configuration set up to remove previous installations, I upgraded all those modules, replacing them with new versions.

Well, this was a major fuckup on my part. Turns out there’s no simple way to restore Debian-distributed versions of the modules without rebuilding the entire system. Worse, this was exactly the sort of thing the community sysadmins feared. They have to maintain a lot of servers. So they naturally prefer that they all be as similar as possible. The new PGXN server had been mostly similar to what they had before, and Dave and company had been willing to compromise quite a bit to get PGXN going, but I, unfortunately, demonstrated how easy it is to ruin the whole thing.

So we decided that a community server isn’t the right place for PGXN. At least not yet. Perhaps in a year or two the Debian distribution will be updated to have all the prerequisites I need. Better yet, maybe someone create a PGXN debian distribution! (Volunteers welcomed.) Then I won’t have to do anything special and we can try again (without any sudo privileges for me!). But in the meantime, I still needed to move things.

Fortunately, depesz came to the rescue. He has a very nice box hosting his blog, explain.depesz.com, and a few other things, and would I like to set things up there? Depesz used perlbrew to set up a Perl install just for the PGXN system accounts, meaning I could install any Perl modules I needed without interfering with the system Perl. And each account has its own privileges to run the services it needs (Manager, API, Site) without the risk of breaking anything else. A few days after getting access, we had everything set up and ready to go. I pulled the trigger on Monday, and it went of without a hitch.

My thanks to depesz for the server and all the assistance, not to mention his donation! PGXN now has a very nice home where it can mature.

And as for the future, I have some thoughts about that, too.

I’d like to blog about the migration itself, and how easy it is (and isn’t) to build PGXN.
There are some bugs to be fixed and minor improvements to be had. Interested in helping out?
I’d love to hear your ideas about how to improve PGXN. What would make it better? What doesn’t work quite right for you now?

And yes, now that this migration is finally done, I expect I’ll have more time to blog and work on PGXN going foward. Please leave your thoughts and ideas in the comments. This thing is wide open to any kind of idea, and I would greatly appreciate your feedback.

↧

Pavel Golub: Why you cannot create table and PK constraint with the same name

January 12, 2012, 4:37 am

≫ Next: Robert Haas: Linux Memory Reporting

≪ Previous: David E. Wheeler: PGXN Has a New Home

Today interesting message appeared on the pgsql-bugs@postgresql.org list:

When I do this

CREATE TABLE "T1" ( "T1_ID" bigint NOT NULL, CONSTRAINT "T1" PRIMARY KEY ("T1_ID" ) );

I get the following message:
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "T1" for table "T1" ERROR: relation "T1" already exists ********** Error ********** ERROR: relation "T1" already exists SQL state: 42P07

It does NOT create either the table or the constraint, and the message is confusing because there is no relation by that name.

The SQLSTATE 42P07 is described in the manual as only as “table undefined”, and it is not clear if the intent is to allow or
disallow the creation of a constraint called the same as the table in Postgresql. Oracle 11g allows this, but my feeling is that
doing this should not be allowed, just as Postgresql handles it.

I am complaining about the confusing error message which IMO is off-topic, not about how the DB handles this.

The quick answer is PRIMARY KEY constraint always has underlying system index with the same name. Thus to implement CREATE statement above PostgreSQL should create table with the name “T1″ and the index with the same name. This is impossible, because tables and indexes are stored in the same system catalog pg_class (they share the same namespace). That is where ambiguity appears. The same is true for UNIQUE constraint.

On the other hand you may freely create CHECK constraint under such conditions:
CREATE TABLE "T1" ( "T1_ID" bigint NOT NULL, CONSTRAINT "T1" CHECK ("T1_ID" > 0 ) );

Filed under: PostgreSQL Tagged: bug, development, PostgreSQL, SQL, trick

↧

Robert Haas: Linux Memory Reporting

January 12, 2012, 7:01 am

≫ Next: Keith: PostgreSQL Oracle FDW... in 8i?!

≪ Previous: Pavel Golub: Why you cannot create table and PK constraint with the same name

As much as I like Linux (and, really, I do: I ran Linux 0.99.something on my desktop in college, and wrote my class papers using vim and LaTeX), there are certain things about it that drive me crazy, and the way it reports memory usage is definitely on the list. It should be possible for a reasonably intelligent human being (in which category I place myself) to answer simple questions about system memory usage, such as "How much memory is my database using?" or "How much memory is my web server using?" relatively simply.
Read more »

↧

Keith: PostgreSQL Oracle FDW... in 8i?!

January 12, 2012, 10:47 am

≫ Next: Chris Travers: Thoughts on what to put in the database

≪ Previous: Robert Haas: Linux Memory Reporting

So one of our clients is still stuck running an old Oracle 8i database. Please, no comments on how there's no more support and they should have moved off this long ago. We know. They know. Moving on...

The introduction of a new Oracle Foreign Data Wrapper peaked our interest. Could it possibly still work with 8i? Working with our SA team, they found that the oldest client libraries still available from Oracle are 10.2. We're not exactly sure when Oracle dropped 8i support from their client libraries, so instead of experimenting at this time they went with known working client libraries for our currently used client which is 10.1.0.2.0 (looking at the package info). So they compiled up some packages for our Solaris environment and off I went.

With the packages installed, setting up the extension couldnt've been easier

CREATE EXTENSION oracle_fdw;

This was my first attempt with using Foreign Data Wrappers, so the next hour or so was spent reading the oracle_fdw docs and jumping around the postgres docs to see how it all works. We already have a connection between PostgreSQL and Oracle working with the dbi_link package, so the Oracle Client connection was already setup and working (explaining that setup is a little out of scope for this blog post). The commands to create the server, user mapping & foreign table follow...

CREATE SERVER oracle_server
FOREIGN DATA WRAPPER oracle_fdw
OPTIONS (dbserver 'ORACLE_DBNAME');

CREATE USER MAPPING FOR CURRENT_USER
SERVER oracle_server
OPTIONS (user 'oracle_user', password '######');

CREATE FOREIGN TABLE keith.fdw_test (
    userid      numeric,
    username    text,
    email       text
    ) 
SERVER oracle_server
OPTIONS ( schema 'keith', table 'fdw_test');

Then run a select and see...

pgsql=# select * from keith.fdw_test;
 userid | username |       email       
--------+----------+-------------------
      1 | keith    | kei...@example.com
(1 row)

It works! This will make the (hopeful) migration off of Oracle 8i that much easier.

Could this possibly be faster than dbi_link for replicating data from Oracle to Postgres? Will be working on rewriting some of our data replication functions to use the FDW and run comparisons. I'll share the results in a future post.

Tags:

↧

Chris Travers: Thoughts on what to put in the database

January 12, 2012, 9:51 pm

≫ Next: Joe Abbate: The Phantom of the Database – Part 4

≪ Previous: Keith: PostgreSQL Oracle FDW... in 8i?!

It seems if you ask three database developers what business logic should be put in the database, you will get (at least!) three answers. Having read Andrew Dunstan's blog post on the subject, as well as Tony Marston's advocacy of putting no business logic in the database, I will give my viewpoint. Here it is:

All logic pertaining to storing, manipulating stored data, retrieving, and presenting data in a relational format belongs in the database provided it can be reduced to atomic calls.

Really that's it. If you want to send an email, don't do that in the database (there are a billion reasons why). If you want to generate HTML documents, the database would not be my first choice of where to put it. There are a couple other ways to look at this though. Here are a previous of my view and why I have changed my mind, slightly narrowing the field:

There is a difference between business logic inherent in the data and business logic regarding what you do with the data. Inherent logic belongs in the database. Use-based logic belongs in the application.

This doesn't quite work in practice as well as it works in theory because it is a bit fuzzy and arguably overinclusive. Things like converting date formats for example could be argued to be inherent logic, but flooding the database with round-trip calls for this would not be an efficient use of resources. Consequently this view only works when narrowly interpreted, and where data format instability is inherently seen as use-based logic. In other words, we end up getting back to something like my current position.

The second issue is that this view is slightly underinclusive. If I want to ensure that an email is always sent when certain criteria are met, that logic has to be tied to the database, and there is a correct way to do this in PostgreSQL (see below). The actual email would not be sent from the database, but information regarding sending it would be stored and made available on transaction commit.

Remember the goal here is to have an intelligent database which exists at the center of the application environment, providing data services in a flexible yet consistent way.

Back for LedgerSMB 1.2, I wrote a sample utility that would listen to updates and where a part would drop below its reorder point, would send out an email to a specified individual. This is the correct approach for a couple of reasons, but the utility still had significant room for improvement. There are two specific problems with the approach taken here:

The trigger was supplied as part of the main database setup scripts which meant that since the trigger logic was incomplete (see below), the utility would necessarily send out duplicate information on each email. Such utilities should actually be self-contained and supply their own triggers.
The trigger was inadequate, simply raising a NOTIFY when a part became short. The trigger should have taken the information from the new row and inserted it into a holding table which the email script could then clear.

So if we look at the way something like this should function transactionally, we get something like this:

Invoice issued, fewer parts onhand than in reorder point. Take summary of information regarding warning, store it in a queue table for email.
Raise a notification.
Invoice transaction commits, making the notification visible to the listening application.
Listening application receives notification, checks queue table,
Prepares and sends email, deleting records from queue table as entered.
Checks again for new records, Commits transaction, and checks for new notifications.

This means you have two asynchronous loops going on, coordinated by transactional controls on the part of the database. The invoice could also be queued, and another loop could be used to generate a printed document to be printed automatically and sent out to the customer. All manner of other things automatically done in such a way that other scripts which import invoices could do so without interrupting any of these loops. Moreover these can be added without disturbing the original application, simplifying testing and QA.

↧

Joe Abbate: The Phantom of the Database – Part 4

January 13, 2012, 8:21 am

≫ Next: Phil Sorber: Deploy Schemata Like a Boss

≪ Previous: Chris Travers: Thoughts on what to put in the database

At the end of November, I finished the third episode with mild suspense: I suggested that the problem of optimistic “locking” could perhaps be solved in PostgreSQL with something other than extra qualifications, row version numbers or timestamps.

Let’s start this episode with action!

moviesdb=> INSERT INTO film VALUES (47478, 'Seven  Samurai', 1956);
INSERT 0 1
moviesdb=> SELECT xmin, xmax, ctid, title, release_year FROM film;
  xmin  | xmax | ctid  |     title      | release_year
--------+------+-------+----------------+--------------
 853969 |    0 | (0,1) | Seven  Samurai |         1956
(1 row)

moviesdb=> UPDATE film SET title = 'Seven Samurai' WHERE id = 47478;
UPDATE 1
moviesdb=> SELECT xmin, xmax, ctid, title, release_year FROM film;
  xmin  | xmax | ctid  |     title     | release_year
--------+------+-------+---------------+--------------
 853970 |    0 | (0,2) | Seven Samurai |         1956
(1 row)

moviesdb=> UPDATE film SET title = 'Sichinin Samurai' WHERE id = 47478;
UPDATE 1
moviesdb=> SELECT xmin, xmax, ctid, title, release_year FROM film;
  xmin  | xmax | ctid  |      title       | release_year
--------+------+-------+------------------+--------------
 853971 |    0 | (0,3) | Sichinin Samurai |         1956
(1 row)

moviesdb=> UPDATE film SET release_year = 1954 WHERE id = 47478;
UPDATE 1
moviesdb=> SELECT xmin, xmax, ctid, title, release_year FROM film;
  xmin  | xmax | ctid  |      title       | release_year
--------+------+-------+------------------+--------------
 853972 |    0 | (0,4) | Sichinin Samurai |         1954
(1 row)

moviesdb=> VACUUM FULL film;
VACUUM
moviesdb=> SELECT xmin, xmax, ctid, title, release_year FROM film;
  xmin  | xmax | ctid  |      title       | release_year
--------+------+-------+------------------+--------------
 853972 |    0 | (0,1) | Sichinin Samurai |         1954
(1 row)

moviesdb=> BEGIN; DELETE FROM film WHERE id = 47478; ROLLBACK;
BEGIN
DELETE 1
ROLLBACK
moviesdb=> SELECT xmin, xmax, ctid, title, release_year FROM film;
  xmin  |  xmax  | ctid  |      title       | release_year
--------+--------+-------+------------------+--------------
 853972 | 853974 | (0,1) | Sichinin Samurai |         1954
(1 row)

moviesdb=> BEGIN; UPDATE film SET release_year = 1956 WHERE id = 47478;
BEGIN
UPDATE 1
moviesdb=> SELECT xmin, xmax, ctid, title, release_year FROM film;
  xmin  | xmax | ctid  |      title       | release_year
--------+------+-------+------------------+--------------
 853975 |    0 | (0,2) | Sichinin Samurai |         1956
(1 row)

moviesdb=> ROLLBACK;
ROLLBACK
moviesdb=> SELECT xmin, xmax, ctid, title, release_year FROM film;
  xmin  |  xmax  | ctid  |      title       | release_year
--------+--------+-------+------------------+--------------
 853972 | 853975 | (0,1) | Sichinin Samurai |         1954
(1 row)

What element in the queries above could be used as a surrogate “row version identifier?” If you examine the changes carefully, you’ll notice that the xmin system column provides that capability. The ctid, a row locator, on the other hand, does not survive the VACUUM operation, and xmax is only used when a row is deleted (or updated, causing it to move).

So my suggestion, in terms of web user interfaces, is that fetching a row for a possible update should include the xmin value, e.g., in the get() method of the Film class, use the following:

    def get(self, db):
        try:
            row = db.fetchone(
                "SELECT xmin, title, release_year FROM film WHERE id = %s",
                (self.id,))

The xmin value can then be sent as a hidden field to the web client, and used in the update() method to implement optimistic concurrency control.

Filed under: PostgreSQL, Python, User interfaces

↧

Phil Sorber: Deploy Schemata Like a Boss

January 13, 2012, 1:02 pm

≫ Next: Andreas Scherbaum: Schedule for PostgreSQL devroom at FOSDEM 2012

≪ Previous: Joe Abbate: The Phantom of the Database – Part 4

One of the many new features in Postgres 9.1 is Extensions. In their simplest form they are a collection of database objects. Think of it as package management for your database. It lets you add, remove and upgrade a collection of objects. All contrib modules shipped with Postgres now use this method.

This is just the beginning though. Normally when you think of contrib modules for postgres you think of some new functions or data types. With extensions you can just as easily manage relations. Create a table in one version of the extension and then add a column in another version. Be sure which versions of your functions and relations exist on a database by managing them as a group instead of as individuals. You can break up your schema into logical chunks and give responsibility of each chunk to an appropriate team in your development group to distribute the workload.

One extension usually has multiple versions. The version identifiers don't need to be numeric. It doesn't understand the concept that version 2 might be newer than version 1. So rather than upgrading, you are merely transitioning between versions. You can think of this like a directed graph. Each node is a version and each edge is an update path. You can also chain paths if there is no direct route. In the example to the right, there is no direct path from β → γ but you can go β → δ → γ. Below you can see all the permutations listed from the database.

example=# SELECT * FROM pg_extension_update_paths('myext') ORDER BY source,target;
source | target | path
--------+--------+------------
α | β | α--β
α | γ | α--δ--γ
α | δ | α--δ
β | α | β--δ--γ--α
β | γ | β--δ--γ
β | δ | β--δ
γ | α | γ--α
γ | β | γ--β
γ | δ | γ--α--δ
δ | α | δ--γ--α
δ | β | δ--γ--β
δ | γ | δ--γ
(12 rows)

There is also a special source called 'unpackaged' that you can use to take an existing schema and put it into an extension. It is simply a script that takes every object in your database and adds it to your extension. This way you don't have to do a big migration to get your schema working out of an extension.

So you've decided you want to try this? Great! Here is what you will need:

A control file (<extname>.control)

default_version = 'β'
comment = 'my extension'
relocatable = true

default_version is the version that will be installed unless otherwise specified. relocatable means that it can be put in an arbitrarily named schema and also move from one schema to another. The only reason that wouldn't be true is if you explicitly reference a schema name in your scripts. There are some other settings you can use as well, such as requires which adds a dependency to another extension. For example if your extension requires the pgcrypto extension.

A Makefile:

EXTENSION = myext
DATA = myext--α.sql myext--α--β.sql myext--α--δ.sql myext--β.sql myext--β--δ.sql myext--γ.sql myext--γ--α.sql myext--γ--β.sql myext--δ.sql myext--δ--γ.sql
PG_CONFIG = /usr/pgsql-9.1/bin/pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)

EXTENSION is the name of your extension. DATA is the list of all the install and update SQL scripts. Every time you add a new version you need to append to this line. PG_CONFIG is the path to your pg_config binary for the install you want to use. The last two lines are just standard Postgres Makefile code for pulling in all the install specific information it needs.

The other pieces you need are the SQL scripts to do all the work. The install scripts (<extname>--<version>.sql) are basically just a bunch of create statements. Something like you would get from a pg_dump. They create all the objects for a particular version. The update scripts (<extname>--<source>--<target>.sql) consist of a combination of create, alter and drop statements. This is all the SQL that you would need to modify a schema of the source version to convert it to a schema of the target version. This will usually be a create table or alter table or a create or replace function. If you need to remove an object you will first need to disassociate it from the extension.

ALTER EXTENSION myext DROP TABLE foo;

Then you can run a standard drop table. If you want to be able to "update" to the previous version, you will need a separate script for that as well. I would suggest not trying to make a graph as complex as the one above. Most likely yours will be completely linear.

So how easy is it to install the extension? Run make install in your extension directory, then:

example=# CREATE SCHEMA myext;
CREATE SCHEMA
example=# CREATE EXTENSION myext WITH SCHEMA myext;
CREATE EXTENSION

That easy. And to update?

example=# ALTER EXTENSION myext UPDATE TO 'γ';
ALTER EXTENSION

You can use \dx to list installed extensions:

example=# \dx
List of installed extensions
Name | Version | Schema | Description
---------+---------+------------+------------------------------
myext | γ | myext | my extension
plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language
(2 rows)

As you can see from this brief introduction this is pretty powerful stuff. It's still a little rough around the edges, but it will be getting better. I recommend taking a look at the documentation to get a more complete understanding of all this can do.

↧

Andreas Scherbaum: Schedule for PostgreSQL devroom at FOSDEM 2012

January 13, 2012, 1:23 pm

≫ Next: Chris Travers: Further Reply to Tony Marston

≪ Previous: Phil Sorber: Deploy Schemata Like a Boss

Andreas 'ads' Scherbaum

The schedule for the PostgreSQL devroom at FOSDEM 2012 is now available:

http://www.fosdem.org/2012/schedule/track/postgresql_devroom

We will have 8 talks on Saturday, February 4th.

See you there ...

↧

Chris Travers: Further Reply to Tony Marston

January 13, 2012, 4:27 pm

≫ Next: Edwin Knuth: making seatmate aware of locations with postgis

≪ Previous: Andreas Scherbaum: Schedule for PostgreSQL devroom at FOSDEM 2012

Tony Marston has updated his piece about unintelligent databases to include information from my earlier reply, and another piece entitled Business Logic in the Database. The reply and his comments in the latter get into some of the reasons why his use case is in fact narrow and therefore are very much worth discussing here. Additionally some other viewpoints will be discussed from the coverage of the debate in the previously mentioned posts.

It is no surprise that Tony continues to take issue with the idea of putting business logic in the database and in fact expands that to include things like foreign keys and check constraints. However many of the points are actually interesting and worth discussing. My own sense is that we can relegate Tony's viewpoint largely to cases where software as a service. As a bonus, I will respond also to his points in his article Stored Procedures are Evil

Marston's New Points in His Reply

That has not been my experience. I have spent all of my career in writing applications which have sole access to their databases. Having multiple applications trying to perform the same function on the same database is something which I have never encountered.

Marston is right that this is usually not on the requirement sheet anywhere, though sometimes it is. Rather additional applications are usually added after the fact, for example to move data from one application to another, to generate reports, or even to do other things.

One of my customers uses LedgerSMB in a fairly automated way, and has all financial information imported through a staging schema from production applications, as well as a second application which creates and manages vendor profiles and payment information (and which eventually imports the data into LedgerSMB). They are an exception though. Most customers who end up with multi-app databases don't set out to build them. Instead they have additional requirements added after the fact where new applications are clearly the right approaches. By removing the logic which prevents invalid data from being stored in the database, Marston's approach slams shut the door on such after-market development.

I disagree completely. It is implementing business logic in the database which is more time consuming, difficult and expensive. And please don't muddy the waters with NoSQL databases as they are a recent phenomena which have yet to reach full maturity and live up to their hype.

The discussion about NoSQL holds true for other schemaless designs. If all you are using the RDBMS for is a very light-weight data store with some backup capabilities, why not use a schemaless database?

NoSQL engines in fact bring few things to the table compared to prerelational engines which are relevant to this discussion. So in this regard, they are not significantly different from any other key/value store, for example Berkley Database, which I think we can all agree has reached full maturity.

The tradeoff is quite simple really. If all you are doing is app development with no reporting, it's far faster development-wise to just throw your data into key/value stores (whether pre-relational or post-relational) than it is to deal with the overhead of relational design. On the other hand, you lose the ability to do data constraints, ensure all stored data is meaningful, and any reasonable ad-hoc reporting. On the whole, the losses for a business application usually outweigh the gains for reasons based on the very features Tony Marston doesn't appear to want to use (foreign keys, check constraints, views, reporting.....) and therefore the relational model wins out more times than not.

Now, my guess is that since Marston repeatedly says he doesn't do reporting but has an ERP, he is specifically talking about ad hoc reporting. Any ERP worth its salt relies heavily on reporting generally. So here we are beginning to see the narrow use case where Marston (and admittedly some big ERP's too) fall into, though I think a lot of the reason for this is licensing revenue.

Most large ERP's that I am aware of do in fact require that all access to the db goes through a middleware layer. I think there are two fundamental reasons for this. The first is that most of these depend on large, expensive, proprietary databases, and since most businesses are more interested in consolidating servers, it is hard to insist that customers, for example, buy SQL Server instead of Oracle, or DB2 instead of SQL Server. This means that there is a requirement to use the lowest common denominator in the RDBMS set.

The second reason is that the ERP can only enforce client access license limits on the middle tier--- it cannot do this effectively at the database level. Therefore in order to enforce client access license payments, this is usually enforced here, and no writes can be allowed beyond the ERP application itself. The goal of course is to build an ERP framework that is at the center of the business application environment and therefore the ERP vendor is most heavily in control of the license fees that they will collect.

That neither of these apply to LedgerSMB is why we can afford to move away from this model and aggressively re-use what frameworks are provided by the database and other components.

You misunderstand me. When I said "dumb" data store I meant the basic properties of any DBMS (whether relational or not) which is the ability to specify tables, columns within tables with data types and sizes, primary keys, unique keys and indexes. More advanced features, such as foreign key constraints, column constraints, triggers, stored procedures and user-defined functions were later additions. It is the use of these "advanced" features which I regard as optional and therefore I have the right to exercise that option and not use them.

Um...... Where to begin......

First, tables and columns are what define the relational model. Non-relational databases use other ways of representing data. Primary keys are also specific to a relational model, as I would think would be unique keys. That means of the features listed for all DBMS's ("whether relational or not") only the most general (indexes) are applicable outside the relational world, and that term is in fact so vague that it isn't clear to me that it is in fact even meaningful outside of a more specific context.

Berkeley Database, for example, is a basic key/value store. You can store whatever you like as a value, and it is referenced by it's key. You could store JSON, XML, YAML... whatever. No tables, columns, unique constraints, etc. You can just serialize your objects and put them there.

The idea that foreign key constraints are an advanced feature that should not be used would be laughable were it not so dangerous. I have seen first hand what the refusal to declare foreign keys can do in the context of an ERP application. I suppose it is great if you want to make money off of untangling people's data. It is very bad though if you want people to trust the application to give the right results from an accounting perspective. I remember a panicked call from a customer one April a few years before we forked LedgerSMB, "It's getting close to tax time and our books don't balance. We need this fixed right now. Can you help?" How long had the books been out of balance? About 8 months. Ouch, that's gotta hurt......

One of the big developments in LedgerSMB 1.2 has been the addition of foreign key constraints, and these were further improved upon in 1.3. Why? Because we take data integrity seriously......

I am most definitely *NOT* advocating a return to the COBOL/VSAM/IMS paradigm. I am simply questioning the argument that because you *CAN* implement business rules in the database then you *MUST*. I don't have to, and I choose not to.

The question of course is one of tradeoffs. How important is your data? How sure are you that folks will not want to create other import/export routines hitting the database? How confident are you in your code?

The more you can leverage declarative constraints at all levels of your application, the more sure you can be, mathematically, that your data is meaningful. It really is that simple.

Marston's Points (And Responses to them) in His Comments On Business Logic in the Database

In the blog entry mentioned above, "Business Logic in the Database" I think effectively narrow the use case where Tony is talking about to those where either revenue from middleware licenses is an issue or where something equivalent is going on:

Luck has nothing to do with it. Nobody is allowed direct access to my databases without my permission, and I never grant that permission. Access is either through my application or through my web services.

Now, thinking for a moment here about what "my databases" must mean here, this can only mean software as a service. Here the customer does not control their own data, which is a compelling reason to switch to LedgerSMB ;-). In particular this sort of control ends for all intents and purposes the moment the application is deployed at a customer's site. Therefore this application either requires license agreements, or it requires control over the database servers.

Either way, it has no place in open source.

To this, a commenter named "Thomas" replied:

While I realize you think your systems are confined to only being accessed by your applications, in the 25 years I’ve worked with databases, I’ve never encountered of a situation where such a restriction could be enforced indefinitely.

That matches my experience too. In fact I would say that I have never found an application deployed in production in a standard RDBMS at a customer's site where the customer found this restriction to be valuable. If it can be enforced indefinitely it is only through legal agreements requiring it, and those are burdens....

The fundamental problem, as I have repeatedly noted in this discussion is that kinds of logic that must be included in the database goes way up as soon as you want to support even the possibility of that second application. Yes, that means at a minimum, check constraints, referential integrity constraints, domains, and the like. Depending on your application, you might certainly have to go to stored procedures just to get an assurance of data integrity. Certainly that is the case when trying to enforce a rule like "all GL transactions must be balanced."

A Diversion into History: Goths, Marston, and Anti-Intellectualism

Robert Young commented:

Considering that you’re championing the failed COBOL/VSAM/IMS paradigm that Dr. Codd put a stake in, why aren’t you the one making such comments? The Goths put Europe into the Dark Ages by enforcing the pre-intelligence paradigm. Those, such as yourself, seek to return to the thrilling days of yesteryear, where each application was its own silo, and data sat in dumb files accessible only through bespoke code. Kind of the Dark Ages. Have a look at WebSocket, and see where the rest of us are going. No more disconnected client edit screens; just call the validation from the database from any client application. Complete flexibility, on both the server and client side. Separation of responsibility. And all that nice rhetoric that OO folks espouse, but seldom perform. You’re not “reinventing the wheel”, just rediscovering the square one that your grandfather used. That’s not progress.

I recognize it is unusual to see a history buff of my sort in application development but.....

Robert Young is wrong here on most counts, but not quite all. Most particularly, he is wrong about the Goths. The two most important books to read on the Goths are probably Peter Heather's book in the Peoples of Europe series and Herwig Wolfram's "The Roman Empire and Its Germanic Peoples." Both these books provide a view of the Goths which is well outside the standard dark ages narrative that Peter Young describes. In fact both authors tend to stress the continuity of Roman customs, land titles, and administration when the Goths ruled Italy and Spain. Sure there was some (necessary) simplification of the laws and so forth, but the decay of Italy continued, as Wolfram notes, because the concentration of wealth was so high that it prevented effective taxation and thus Ostrogothic Italy eventually fell to the Byzantines. Wolfram also credits the fall of the Western Empire with the same forces he claims doomed the Goths.

In Arms and Armor of the Medieval Knight, Miles and Paddock suggest that the Migration Ages were times when the Germanic peoples spread important metallurgical technologies across Europe, including the all-important technique of pattern-welding different forms of steel into weapons of far greater quality than the Romans managed. (The next major advancement in metallurgy would have to wait until the conversion of Scandinavia allowed Christians to import technologies for manufacture of homogeneous steel swords and other tools from Northern Europe.)

In fact, it's pretty clear due to the fact that the Goths arose from within the Germanic empire and that at least one Gothic rebellion sought to obtain a Roman generalship for their leader, that the Goths were far less anti-Roman than the Romans were anti-Goth..... A lot of this had to do with the Goths early conversion to a form of Christianity known as Arianism, after its founder, Arius of Alexandria. This lead to several centuries of religious feuding between Nicene and Arian Christianity which is quite clear from every author of anywhere near that time.

And so we have a partial rebuttal of the master narrative that the great, progressive, scientific Roman civilization was destroyed by the backwards, barbaric, superstitious Goths.

But narratives matter including in this discussion. Just as the history is wrong, I think Young's central point is wrong too. I have come to the conclusion that conservatism in design choices is a good thing, and that a post-modern approach means valuing even approaches that seem at the moment to be outmoded. The real question is not which method is best, but rather in which use cases one method or another wins out.

In other words, to me there is no progress of methodologies, innovation is a dirty word, and we should be sceptical of new, overly hyped technologies, looking at them through a lens of what has been learned in the past. From this viewpoint, I have to say that I think that accusing people of getting in the way of progress is actually giving them a compliment, for in the words of Henry Spenser, "Those who do not learn from UNIX are destined to reinvent it badly."

Where Mr. Young is right is in the separation of responsibility point. This point cannot be overemphasized. The idea that a database takes on responsibility for storing meaningful data is something which no database which is both non-relational and non-dedicated can manage. The only use cases where one can avoid this are things like LDAP (which is a horror for reasons I won't go into here). And therefore relational databases usually win out, again, for reasons pertaining to the features Marston doesn't use.

Marston's Points in "Stored Procedures Are Evil"

In this article, Marston oversimplifies the positions of where/when to use stored procedures to two very extreme positions, entirely erasing any middle:

Use stored procedures and triggers only when it is an absolutely necessity.

vs.

Use stored procedures and triggers at every possible opportunity simply because you can.

He argues that his knowledge is better because:

You only know what you have been taught, whereas I know what I have learned.

Well, I used to agree with Marston. I have since come to learn that he is wrong. Perhaps it is this change of opinion which qualifies me to look at his arguments and address them, acknowledging where he has a poing.

Before I start, though, I will say that people who use stored procedures at every possible opportunity (and I have met a few) do cause problems and that such is not a viable position. The obvious middle ground is:

Use stored procedures wherever they make sense given the functions of the database

But that may be too nuanced for Marston..... Presumably we should look at his arguments against stored procedures and it will make more sense.

First though let's look at what he says about arguments in favor of stored procedures:

This is a common argument that many people echo without realising that it became defunct when role-based security was made available. A good DBA defines user-roles in the database, and users are added to those roles and rights are defined per role, not per user. This way, it is easy to control which users can insert / update and which users can for example select or delete or have access to views in an easy way.

There are certain cases where role-based security doesn't quite get you there. Now these may not impact all applications but they do affect a significant set of them. The big one is something like "users of role x must be given the permission to mark transactions as approved, but cannot mark approved transactions as unapproved." Because the security setting depends on the value of the input, you cannot enforce the security using database role-based security. In these cases, the only secure way to grant access is through a stored procedure.

He also says:

With a view it is possible to control which data is accessed on a column basis or row basis. This means that if you want user U to select only 2 or so columns from a table, you can give that user access to a view, not the underlying table. The same goes for rows in one or more tables. Create a view which shows those rows, filtering out others. Give access rights to the view, not the table, obviously using user-roles. This way you can limit access to sensitive data without having to compromise your programming model because you have to move to stored procedures.

The problem here is that with update permissions, views suffer from all the portability problems of stored procedures. So this doesn't necessarily provide a significant win.

It is also said that stored procedures are more secure because they prevent SQL injection attacks. This argument is false for the simple reason that it is possible to have a stored procedure which concatenates strings together and therefore open itself up to sql injection attacks (generally seen in systems which use procedures and have to offer some sort of general search routine), while the use of parameterized queries removes this vulnerability as no value can end up as being part of the actually query string.

First, it's important to note that stored procedures get called through a SQL query like anything else, and therefore the underlying SQL query is still vulnerable to SQL injection attacks and these need to be addressed. So parameterized queries still have to be used. Similarly, Marston has a point about dynamic SQL in stored procedures (a problem which is actually more severe in some cases than he gives credit for).

This being said, it is simpler to understand where the potential issues are in stored procedures, and it is simpler to audit them for problems than it is to deal with either dynamic SQL or generated SQL based on the sorts of mappers that Marston suggests.

As for performance, Marston states:

The execution of SQL statements in stored procedures may have been faster than with dynamic SQL in the early days of database systems, but that advantage has all but disappeared in the current versions. In some cases a stored procedure may even be slower than dynamic SQL, so this argument is as dead as a Dodo.

My general experience is that the more dynamic the SQL is, the harder it is to tune because the more abstraction layers one has to go through in order to find the problems. ORM's are usually horrible in this regard because they typically do a very large number of operations when a smaller number of operations could succeed and this is usually not fixable. Similarly dynamic SQL, where the whole query is created at run-time imposes additional overhead when trying to locate and correct performance issues.

I have seen horrible performance from stored procedures, but I have seen how much easier they are to tune (when written in a maintainable way).

One thing to keep in mind though is that stored procedures are not necessarily more or less efficient. They do take skill and care to write so that they perform well under demanding cases. There are some additional gotchas to be aware of especially when folks who are more app authors than SQL authors start trying to write them. However on the whole, I think this problem is manageable and the performance gains worth it.

He goes on to say:

Performance should not be the first question. My belief is that most of the time you should focus on writing maintainable code. Then use a profiler to identify hot spots and then replace only those hot spots with faster but less clear code. The main reason to do this is because in most systems only a very small proportion of the code is actually performance critical, and it's much easier to improve the performance of well factored maintainable code.

I completely agree with this btw. Clear code is less costly to maintain than is unclear code. And I have read horribly coded stored procedures just as I have read beautifully coded ones. Focus on clear, maintainable code. Keep things simple.

BTW this brings me to one clear advantage of stored procedures: you can keep all your SQL code in different files from your higher tier programming logic. This can be used to really help with maintainability. Additionally I recommend the following for people writing stored procedures: "Use stored procedures, as much as possible, as named queries. Keep them, as much as you can, to single SQL statements with named parameters."

His arguments against stored procedures become more interesting:

Instead of having a structure which separates concerns in a tried and trusted way - GUI, business logic and storage - you now have logic intermingling with storage, and logic on multiple tiers within the architecture. This causes potential headaches down the road if that logic has to change.

This mangling goes both ways. A clear stored procedure architecture, using stored procedures where they make sense, can have the effect of untangling your database queries from your application code, and thus making a multi-tiered architecture clearer and cleaner. Of course, this isn't always the case, so common sense needs to be applied. Again, use stored procedures where they make sense and simplify things, but don't use them just because you can, and don't avoid them just because you can.

He then goes on to talk about the maintenance issues of stored procedures:

The reason for this is that stored procedures form an API by themselves. Changing an API is not that good, it will break a lot of code in some situations. Adding new functionality or new procedures is the "best" way to extend an existing API. A set of stored procedures is no different. This means that when a table changes, or behaviour of a stored procedure changes and it requires a new parameter, a new stored procedure has to be added. This might sound like a minor problem but it isn't, especially when your system is already large and has run for some time. Every system developed runs the risk of becoming a legacy system that has to be maintained for several years. This takes a lot of time, because the communication between the developer(s) who maintain/write the stored procedures and the developer(s) who write the DAL/BL code has to be intense: a new stored procedure will be saved fine, however it will not be called correctly until the DAL code is altered. When you have Dynamic SQL in your BL at your hands, it's not a problem. You change the code there, create a different filter, whatever you like and whatever fits the functionality to implement.

I don't understand this one at all. If the stored procedures are an API and extending them breaks things, why aren't the tables an API and extending them breaks things? Of course, Marston talks about his own approach here which does dynamic discovery. But in that case..... LedgerSMB does dynamic discovery on stored procedures. So it isn't clear to me that this is something that is specific to one side or the other.

On to the next one:

Business logic in stored procedures is more work to test than the corresponding logic in the application. Referential integrity will often force you to setup a lot of other data just to be able to insert the data you need for a test (unless you're working in a legacy database without any foreign key constraints). Stored procedures are inherently procedural in nature, and hence harder to create isolated tests and prone to code duplication. Another consideration, and this matters a great deal in a sizable application, is that any automated test that hits the database is slower than a test that runs inside of the application. Slow tests lead to longer feedback cycles.

I haven't found that to be true at all. In fact coming up with good unit tests for database-driven applications usually means some sort of sample data set anyway. A very clear win additionally is the ability to run tests on production systems in transactions that are guaranteed to roll back. Testing this on a live system via application logic is a lot more dangerous, and more likely to insert test data into the database by mistake.

If all the business logic is held in the database instead of the application then the database becomes the bottleneck. Once the load starts increasing the performance starts dropping. With business logic in the application it is easy to scale up simply by adding another processor or two, but that option is not readily available if all that logic is held in the database.

This is not necessarily the case. If you put everything in stored procedures that the database would already be doing, there is no reason to think the db becomes more of a bottleneck than it would have been before.

This is a big issue if you want an application where the customer can insert their own business logic, or where different logic is required by different customers. Achieving this with application code is a piece of cake, but with database logic it is a can of worms.

He speaks authoritatively here of something he professes at the beginning of the article not to really know well. The fact here is that adding custom logic is *different* in a stored-procedure-centered application than it is in a business application. There are ways of doing this right and managing the problem, just as there are ways of doing it on the application side incorrectly and making a mess.

The next point made me laugh:

A big problem with database triggers is that the application does not know that they exist, therefore does not know whether they have run or not. This became a serious issue in one application (not written by me) which I was maintaining. A new DBA who was not aware of the existence of all these triggers did something which deactivated every trigger on the main database. The triggers were still there, they had not been deleted, but they had been turned off so did not fire and do what they were supposed to do. This mistake took several hours to spot and several days to fix.

What competent DBA disables triggers without figuring out what they do first? However, for all of that he does have a point. Triggers can be used incorrectly to ensure that an application operation has a desired side effect. This is not a very maintainable way to go. It is better generally to use triggers to hook into a host application as a means of either enforcing RI, or allowing a third party application to force side effects that it needs. The main application should not depend on the use of triggers to run effectively. Only other applications should depend on triggers they provide for information as to what is going on in the host application, or as additional safety measures aimed at protecting data integrity.

As for his points about version control, all I can say is "test cases, test cases, test cases, all in transactions that roll back." You might not know which version your stored procedures are, but you WILL know that they are behaving the way you want them to.

Finally to the point of vendor lock-in.

Here Marston has a clear point. If you build a stored-procedure-centric application it will be difficult (though not necessarily impossible depending on various factors) to migrate to another database. Of course, migrating applications between databases is not necessarily a walk in the park either (Oracle empty string handling comes to mind).

In general, if you are writing on an Oracle db, it is feasible to migrate stored procedures perhaps to Fyracle and PostgreSQL, but less feasible to migrate to MS SQL. If you are writing Java stored procedures in Oracle, then you can migrate these to PostgreSQL without too much difficulty. If you are writing PL/Perl stored procedures in PostgreSQL, you will probably never be able to migrate them anywhere. On the other hand if you are writing an application to ship to customers and it must run on Oracle, DB2, MS SQL, and PostgreSQL, then stored procedures are out. In general, this comes back to "know your technology and choose it appropriately."

In conclusion, there are certainly cases where you don't want to use stored procedures, for example, because you don't want to be tied to a single RDBMS. However this is no reason to have a blanket avoidance strategy any more than it is a good idea to put everything possible inside the db.

Marston concludes by saying that stored procedures are optional and therefore his view is not clearly wrong, but calling something optional and calling something evil are very different. I think Marston is clearly wrong with the "evil" categorization, but generally right with the optional one. For his specific use case, stored procedures may not work very well, and hence might be avoided. However, generalizing that out is dangerous.

↧

Edwin Knuth: making seatmate aware of locations with postgis

January 14, 2012, 9:49 am

≫ Next: Leo Hsu and Regina Obe: Table Inheritance and the tableoid

≪ Previous: Chris Travers: Further Reply to Tony Marston

Now that I’ve got the comment view more or less working, it’s time to link it up with actual trimet routes. It seems like we need an initial screen to show to users when they first bring up the app. I imagine that people will be either riding the bus/max, or they will be waiting at the stop. Seems like we need to start by showing routes that are closest the user’s current location.

performing spatial queries in postgis

Our backend needs to support a method that takes a latitude and longitude in the commonly used WGS84 projection and return the routes that are closest to that point. Just getting the nearest stops isn’t useful because the rider may be actually travelling the route and not anywhere near a stop. Luckily postgis makes it easy to select lines based on proximity to a point.

SELECT "RTE" as route,"RTE_DESC" as description,"DIR_DESC" as direction,
	distance(PointFromText('POINT(-122.613639 45.499541)', 4326), the_geom) as distance
from tm_routes order by distance limit 10;

            view raw
            spatial.sql
            This Gist brought to you by GitHub.
          

This query returns the following results when applied against the postgis table I created from the trimet shapefile projected to WGS84. Note that postgres requires that we double quote the column names because they are capitalized.

 route |      description       |           direction            |      distance       
-------+------------------------+--------------------------------+---------------------
     9 | Powell/Broadway        | To Powell & 98th or Gresham TC | 0.00200889736828729
     9 | Powell/Broadway        | To Saratoga & 27th             | 0.00200890251737265
    14 | Hawthorne              | To Foster & 94th               | 0.00229043846550735
    14 | Hawthorne              | To Portland City Center        | 0.00229044244641688
    71 | 60th Ave/122nd Ave     | To Foster & 94th               | 0.00460586966268184
    71 | 60th Ave/122nd Ave     | To Clackamas Town Center       | 0.00460588437600594
     4 | Division/Fessenden     | To Gresham TC                  | 0.00575370636240205
     4 | Division/Fessenden     | To St Johns                    | 0.00575370636240205
    75 | Cesar E Chavez/Lombard | To St. Johns                   | 0.00898150226029743
    75 | Cesar E Chavez/Lombard | To Milwaukie                   | 0.00898150543661629
(10 rows)

view raw results.txt This Gist brought to you by GitHub.

grouping results in sql

The routes returned look great for my current location. The data is accurate enough to differentiate between routes that run on the same street in different directions. If I were waiting at a bus stop, it would be a safe bet to assume that the bus I want is on the same side of the street as me. I still want to return other routes, but at this point I don’t really care what direction they are going. I want to group the results and just return a single row for each route. We can achieve that goal by tweaking the sql.

SELECT "RTE" as route, "RTE_DESC" as description,
	min(distance(PointFromText('POINT(-122.613639 45.499541)', 4326), the_geom)) as distance
from tm_routes group by route, description
order by distance limit 10;

            view raw
            improved.sql
            This Gist brought to you by GitHub.
          

route |      description       |      distance       
-------+------------------------+---------------------
     9 | Powell/Broadway        | 0.00200889736828729
    14 | Hawthorne              | 0.00229043846550735
    71 | 60th Ave/122nd Ave     | 0.00460586966268184
     4 | Division/Fessenden     | 0.00575370636240205
    75 | Cesar E Chavez/Lombard | 0.00898150226029743
    66 | Marquam Hill/Hollywood | 0.00898150861891694
    17 | Holgate/NW 21st        | 0.00924074516961275
    10 | Harold St              |  0.0148568363291083
    15 | Belmont/NW 23rd        |  0.0170180326549672
    19 | Woodstock/Glisan       |   0.020370095002805
(10 rows)

view raw improved_results.txt This Gist brought to you by GitHub.

putting it all together

I’m happy with these results. The next step is to wrap it in a method and expose it on our backend. Then we just need to write a new Sencha touch view to request the data based on the coordinate returned from the geolocation api in the browser. Should be a snap.

↧

Leo Hsu and Regina Obe: Table Inheritance and the tableoid

January 16, 2012, 2:52 am

≫ Next: Andrew Dunstan: Under the wire

≪ Previous: Edwin Knuth: making seatmate aware of locations with postgis

If I could name a number one feature I love most about PostgreSQL, it's the table inheritance feature which we described in How to Inherit and Uninherit. A lot of people use it for table partitioning using CONSTRAINT EXCLUSION. Aside from that, in combination with PostgreSQL schema search_path (customizable by user and/or database) it makes for a very flexible abstraction tool. For example, for many of our web apps that service many departments where each department/client wants to keep a high level of autonomy, we have a schema set aside for each that inherits from a master template schema. Each department site uses a different set of accounts with the primary schema being that of the department/client so that they are hitting their own tables.

Inheritance allows us to keep data separate,do roll-up reports if we need to, use the same application front-end, and yet allows us the ability to add new columns in just one place (the master template schema). It is more flexible than other approaches because for example we may have a city organization that need to share tables, like for example a system loaded list of funding source shared across the agency. We can set aside these shared tables in a separate schema visible to all or have some have their own copy they can change if they don't want to use the shared one.

Every once in a while, we find ourselves needing to query the whole hierarchy and needing to know which table the results of the query are coming from. To help solve that issue, we employ the use of the system column tableoid which all user tables have. The tableoid is the the object id of a table. PostgreSQL has many system columns that you have to explicitly select and can't be accessed with a SELECT * with the tableoid being one of them. These are: tableoid, cmax,cmin, xmin,xmax,ctid which are all described in System Columns. The PostgreSQL docs on inheritance have examples of using it, but we thought it worthwile to repeat the exercise since it's not that common knowledge and is unique enough feature of PostgreSQL that others coming from other relational databases, may miss the treat. I've often demonstrated it to non-PostgreSQL users who use for example SQL Server or MySQL, and they literally fall out of their chair when I show the feature to them and its endless possibilities.

Continue reading "Table Inheritance and the tableoid"

↧

Andrew Dunstan: Under the wire

January 16, 2012, 7:55 am

≫ Next: Josh Berkus: SFPUG Reviewfest 2012

≪ Previous: Leo Hsu and Regina Obe: Table Inheritance and the tableoid

On Wednesday, four days before the start of the final commitfest for release 9.2 of PostgreSQL, Robert Haas published YA patch to include JSON as a core type. Basically, his patch just parses the text to make sure it was valid JSON, and stores it as text. I'd just about given up on getting this into release 9.2, but I thought his patch was just a bit too minimal, so I put on the running shoes to add in some functionality to produce JSON from the database: query_to_json(), array_to_json() and record_to_json(). Yesterday I put out this patch, extending his, right on the deadline for this commitfest and thus for this release. A few simple examples from the regression tests:

SELECT query_to_json('select x as b, x * 2 as c from generate_series(1,3) x',false);
                query_to_json                
---------------------------------------------
 [{"b":1,"c":2},{"b":2,"c":4},{"b":3,"c":6}]
(1 row)

SELECT array_to_json('{{1,5},{99,100}}'::int[]);
  array_to_json   
------------------
 [[1,5],[99,100]]
(1 row)

-- row_to_json
SELECT row_to_json(row(1,'foo'));
     row_to_json     
---------------------
 {"f1":1,"f2":"foo"}
(1 row)

and a slightly less simple example:

SELECT row_to_json(q) 
FROM (SELECT $$a$$ || x AS b, 
         y AS c, 
         ARRAY[ROW(x.*,ARRAY[1,2,3]),
               ROW(y.*,ARRAY[4,5,6])] AS z 
      FROM generate_series(1,2) x, 
           generate_series(4,5) y) q;
                            row_to_json                             
--------------------------------------------------------------------
 {"b":"a1","c":4,"z":[{"f1":1,"f2":[1,2,3]},{"f1":4,"f2":[4,5,6]}]}
 {"b":"a1","c":5,"z":[{"f1":1,"f2":[1,2,3]},{"f1":5,"f2":[4,5,6]}]}
 {"b":"a2","c":4,"z":[{"f1":2,"f2":[1,2,3]},{"f1":4,"f2":[4,5,6]}]}
 {"b":"a2","c":5,"z":[{"f1":2,"f2":[1,2,3]},{"f1":5,"f2":[4,5,6]}]}
(4 rows)

We need to sort out a few encoding issues, but I'm now fairly hopeful that we'll have some very useful and fast JSON functionality in release 9.2.

↧

Josh Berkus: SFPUG Reviewfest 2012

January 16, 2012, 8:01 am

≫ Next: Bruce Momjian: Coming to Boston

≪ Previous: Andrew Dunstan: Under the wire

The San Francisco PostgreSQL Users Group held its second-ever reviewfest in order to help with the final CommitFest of PostgreSQL 9.2. In summary several patches got reviewed and a bunch of new folks learned how to review patches who didn't know before.

↧

Bruce Momjian: Coming to Boston

January 16, 2012, 9:00 am

≫ Next: Bruce Momjian: Presentations Updated

≪ Previous: Josh Berkus: SFPUG Reviewfest 2012

Speaking of presentations, I get to use my updated presentations this Thursday when I speak to the Boston PostgreSQL Users Group. Then, on Friday, I get to use more of them when I do training at EnterpriseDB's headquarters.

↧

Bruce Momjian: Presentations Updated

January 16, 2012, 9:00 am

≫ Next: Andrew Dunstan: I didn't say that

≪ Previous: Bruce Momjian: Coming to Boston

As part of my server upgrade, I migrated to a newer version of LyX and LaTeX, but most significantly, to a more modern and powerful LaTeX document class, Beamer. All 1300 slides in my presentations have been updated. If you see something that needs improvement, no matter how minor, please let please know via email, chat, or blog comment.

↧

Andrew Dunstan: I didn't say that

January 16, 2012, 6:28 pm

≫ Next: Bruce Momjian: TOAST-y Goodness

≪ Previous: Bruce Momjian: Presentations Updated

Somebody has linked to my earlier blog post with a link title saying the JSON type would definitely be in release 9.2 of PostgreSQL. I haven't said that. I said I hoped it would be. It hasn't been committed yet, and we still have some issues to sort out. People need to be more careful about this sort of thing.

↧

Bruce Momjian: TOAST-y Goodness

January 17, 2012, 9:00 am

≫ Next: Andrew Dunstan: Another transformation

≪ Previous: Andrew Dunstan: I didn't say that

Many things are better toasted: cheese sandwiches, nuts, marshmallows. Even some of your Postgres data is better toasted — let me explain.

Postgres typically uses an eight-kilobyte block size — you can verify this by running pg_controldata:

↧