Andrew Dunstan: Another transformation

January 17, 2012, 9:39 pm

≪ Previous: Bruce Momjian: TOAST-y Goodness

A few people have mentioned an hstore to JSON transformation to me. I don't think it's possible to get a perfect fit, but something like this gets fairly close, and will work for many cases:

create or replace function hstore_to_json(h hstore) returns text language sql
as $f$

   select '{' || array_to_string(array_agg(
          '"' || regexp_replace(key,E'[\\"]',E'\\\&','g') || '":' ||
          case
            when value is null then 'null'
            when value ~ '^true|false|(-?(0|[1-9]\d*)(\.\d+)?([eE][+-]?\d+)?)$' then value
            else '"' || regexp_replace(value,E'[\\"]',E'\\\&','g') || '"'
          end 
       ),',') || '}'
   from each($1)

$f$;

↧

Andrew Dunstan: Tree climbing

January 18, 2012, 10:22 am

≫ Next: John DeSoi: pgEdit on GitHub

≪ Previous: Andrew Dunstan: Another transformation

We don't provide any special indexing for XML on PostgreSQL - in fact there are no comparison operators defined for the type at all - and the current JSON patch won't provide anything special there either. But I have been wondering exactly what sort of indexing might be useful for tree structured objects. For the most part I'm inclined to think that these should be treated as singleton objects where we don't need to search on them (and if we do then the database is probably very badly structured). At least that's how I use them. For example, I can imagine storing a web session object as XML or JSON. But I'm going to know the session ID and store that as the key of the session table. I should never need to search for a session by its content, only by its ID. But let's say we did need to. What sort of operators would we use to index the data?

↧

John DeSoi: pgEdit on GitHub

January 18, 2012, 9:14 pm

≫ Next: Keith: PG Extractor - Got Git

≪ Previous: Andrew Dunstan: Tree climbing

The source for the pgEdit TextMate bundle is now available on GitHub at https://github.com/desoi/pgedit-textmate.

↧

Keith: PG Extractor - Got Git

January 19, 2012, 7:21 am

≫ Next: Bruce Momjian: TOAST Queries

≪ Previous: John DeSoi: pgEdit on GitHub

I've finally gotten Git support added into pg_extractor. This works pretty much the same as the SVN option did already. One important difference is that there are two options for committing

--git

This just does a local commit to a locally maintained repository

--gitpush

This does a local commit as well as push to an already configured remote repository

You use either one option or the other, not both. The Git options also expects a proper .gitconfig file for your environment to be set up for the user running pg_extractor. There is no option for passing the git username like SVN has (and I don't see a need for one). Remote repositories will also have to be configured in advance of using the push option.

An important thing to note about using svn or git options with pg_extractor is that it does not do any initial VCS setups on the folders it creates and outputs too. It's best to run it first without any VCS options to get an initial dump and perform a manual commit (and/or push with git). Then for any future runs of pg_extractor, use the VCS options to track changes.

As always, please report any bugs or issues!

Tags:

↧

Bruce Momjian: TOAST Queries

January 19, 2012, 12:00 pm

≫ Next: Bruce Momjian: New Server

≪ Previous: Keith: PG Extractor - Got Git

As a followup to my previous blog entry, I want to show queries that allow users to analyze TOAST tables. First, we find the TOAST details about the test heap table:

SELECT oid, relname, reltoastrelid, reltoastidxid FROM pg_class where relname = 'test';
  oid  | relname | reltoastrelid | reltoastidxid
-------+---------+---------------+---------------
 17172 | test    |         17175 |             0

↧

Bruce Momjian: New Server

January 20, 2012, 6:30 am

≫ Next: Mark Wong: January Meeting Recap

≪ Previous: Bruce Momjian: TOAST Queries

A few weeks ago, I finally replaced my eight-year-old home server. The age of my server, and its operating system, (BSD/OS, last officially updated in 2002) were a frequent source of amusement among Postgres community members. The new server is:

Super Micro 7046A-T 4U Tower Workstation 2 x Intel Xeon E5620 2.4GHz Quad-Core Processors Crucial 24GB Dual-Rank PC3-10600 DDR3 SDRAM Intel 160GB 320 Series SSD Drive 4 x Western Digital Caviar Green 2TB Hard Drives

↧

Mark Wong: January Meeting Recap

January 20, 2012, 7:39 pm

≫ Next: Marc Balmer: Get Database Security Right

≪ Previous: Bruce Momjian: New Server

11 people showed up for our first meeting of the new year. Thanks to Iovation for providing a comfortable space with pizza.

We will be having another PRP soon, as well as a YAMS hackathon. We may combine the two into one event. Watch this space for details.

Here are Tim’s slides from his Database Trending talk last night. I can’t wait to try this at home!

Database Trending

I converted his .odp slides to .ppt so’s I could a) upload them to WP (no .odp allowed!) and b) include his notes.

↧

Marc Balmer: Get Database Security Right

January 22, 2012, 3:23 am

≫ Next: Hubert 'depesz' Lubaczewski: Waiting for 9.2 – NULLS from pg_*_size() functions

≪ Previous: Mark Wong: January Meeting Recap

Most open source applications that use a PostgreSQL database follow the same scheme for authorization: A single role is created in the database, which owns all objects in database and has full access to them. As soon as the application is launched, it connects to the database with this role, using a password that is stored somewhere, e.g. in a config file.

From a users perspective, this is nice: The user does not even need to know there is a database under the hood. From a security perspective, this is a nightmare, for mostly two reasons.

Continue reading "Get Database Security Right"

↧

Hubert 'depesz' Lubaczewski: Waiting for 9.2 – NULLS from pg_*_size() functions

January 22, 2012, 7:51 am

≫ Next: Selena Deckelmann: I’m keynoting today at SCALE10x

≪ Previous: Marc Balmer: Get Database Security Right

On 19t of January, Heikki Linnakangas committed patch: Make pg_relation_size() and friends return NULL if the object doesn't exist. That avoids errors when the functions are used in queries like "SELECT pg_relation_size(oid) FROM pg_class", and a table is dropped concurrently. Phil Sorber This patch on its own is not very visible, but it [...]

↧

Selena Deckelmann: I’m keynoting today at SCALE10x

January 22, 2012, 9:17 am

≫ Next: Valentine Gogichashvili: Schema based versioning and deployment for PostgreSQL

≪ Previous: Hubert 'depesz' Lubaczewski: Waiting for 9.2 – NULLS from pg_*_size() functions

Slides (as of this moment) are here: Mistakes were made. I changed quite a bit of the beginning and end, given how bit the audience is. Previous talks, we’ve usually ended with a fun “omg, here’s the craziest story I know” session. I imagine we’ll get a little bit of that today.

Postgres folks will note a relevant picture on slide 13.

This is my first keynote! Thanks so much to SCALE for inviting me. There were at least 1500 registered attendees as of Friday, so looking forward to a big crowd.

↧

Valentine Gogichashvili: Schema based versioning and deployment for PostgreSQL

January 22, 2012, 11:25 am

≫ Next: Hubert 'depesz' Lubaczewski: Waiting for 9.2 – split of current_query in pg_stat_activity

≪ Previous: Selena Deckelmann: I’m keynoting today at SCALE10x

I am one of the supporters of keeping as much business logic in the database itself. This reduces the access layer of the application to mostly dumb transport and data transformation logic that can be implemented using different technologies and frameworks, without the need to re-implement critical data consistency and data distribution logic in several places, and gives an easy possibility to control what you are doing with your data and how your applications are allowed to access this data or even change exchange the underlying data structures transparently from the upper level of the code and without the need of a downtime. It also gives a possibility to add an additional layer of security allowing access to the data only through stored procedures, that can change their security execution context as needed (SECURITY DEFINER feature of PostgreSQL).

This approach has some disadvantages of course. One of the biggest technical problems, that is very easily becoming an organizational problem if you have a relatively big teem of developers, a problem of how to rapidly rollout new features without touching old functioning stored procedures, so that old versions of your upper level applications can still access the previous versions of stored procedures, and newly rolled out nodes with new software stack on them, access new stored procedures doing something more, or less, or returning some other data sets compared to their previous versions. And of course hundreds of stored procedures that are there to access and manipulate data are enough to make any attempt to keep all new versions of them backwards compatible, a nightmare.

Classical way to do this, would be to keep all the changes backwards compatible and if it is not possible, then create a new version of a stored procedure with some version suffix like _v2, mark the previous version as deprecated and after all your software stack is rolled out to use that new function, just drop the previous version. But if you are rolling out new version of the whole stack once of twice a week, the control of what is used and that is not becomes quite a challenge... and discipline of all the developers should be really good as well. Stored procedures are not the only objects, that are changing together with them. The return or input types can change as well. Changing of a return type, that is used by more then 2 stored procedures in a backwards compatible fashion is a pure horror if you want to do it without creating a new version of such a type and new versions of all the stored procedures, that use it. Dependency control becomes another problem.

My solution to that problem was to introduce a schema based versioning of PostgreSQL stored procedures. It uses an idea of PostgreSQL schema and search_path for a session.

So all the stored procedures, that are exposed to the client software stack, are grouped in one API schema that contains only stored procedures and types needed by them.

Schema name contains a version in it, like proj_api_vX_Y_Z, where X_Y_Z is a version, that a software stack is targeted to. Software stack does SET search_path to proj_api_vX_Y_Z, public; immedeately after it gets a connection from the pool and all calls to the stored procedures are done without explicitly specifying a schema name for that API stored procedure and PostgreSQL finds the needed stored procedure from the specified schema.

So when a branch is stable and branch version is fixed, it is used as a property that will be used when setting the default search_path for the software, that is being deployed for that branch. For example in Java using BoneCP JDBC Pool, setting an inintSQL property of all the pools used to access proj database.

We are storing the sources of all the stored procedures (and other database objects) in a special database directory structure that is checked in into a usual SCM system. All the files sorted in corresponding folders and are prefixed with a 2 digit numeric prefix to ensure the order of sorting (good old BASIC times :) ). Like

`50_proj_api`
	`00_create_schema.sql`
	`20_types`
		`20_simple_object_input_type.sql`
	`30_stored_procedures`
		`20_get_object.sql`
		`20_set_object.sql`

Here 00_create_schema.sql file is containing CREATE SCHEMA proj_api; statement, statements to set default security options for newly created stored procedures and a SET search_path TO proj_api, public; statement, that ensures, that all the objects, that are coming after that file are injected into the correct API schema. An example of 00_create_schema.sql file can look like:

RESET role;

CREATE SCHEMA proj_api AUTHORIZATION proj_api_owner;

ALTER DEFAULT PRIVILEGES FOR ROLE postgres IN SCHEMA proj_api REVOKE EXECUTE ON FUNCTIONS FROM public;

GRANT USAGE ON SCHEMA proj_api TO proj_api_usage;

ALTER DEFAULT PRIVILEGES FOR ROLE postgres IN SCHEMA proj_api GRANT EXECUTE ON FUNCTIONS TO proj_api_executor;
ALTER DEFAULT PRIVILEGES FOR ROLE proj_api_owner IN SCHEMA proj_api GRANT EXECUTE ON FUNCTIONS TO proj_api_executor;
ALTER DEFAULT PRIVILEGES IN SCHEMA proj_api GRANT EXECUTE ON FUNCTIONS TO proj_api_executor;

SET search_path to proj_api, public;

DO $SQL$
BEGIN
  IF CURRENT_DATABASE() ~ '^(prod|staging|integration)_proj_db$' THEN
    -- change default search_path for production, staging and integration databases
    EXECUTE 'ALTER DATABASE ' || CURRENT_DATABASE() || ' SET search_path to proj_api, public;';
  END IF;
END
$SQL$;

SET role TO proj_api_owner;

This kind of layout gives a possibility to bootstrap API schema objects into a needed database easily and that is very important, to keep track of all the database logic changes in SCM system that lets you review and compare the changes between releases.

Bootstrapping into a development database can be done by a very easy script like:

(
echo 'DROP SCHEMA proj_api CASCADE;'
find 50_proj -type f -name '*.sql' \
  | sort \
  | xargs cat \
) | psql dev_proj_db -1 -f -

In case of development database, we are actually bootstrapping all the objects including tables into a freshly prepared database instance, so that integration tests can run and modify data as they want.

Injecting into a production or staging database can be automated and implemented with different kind of additional checks, but at the end it is something like:

(
cat 50_proj/00_create_schema | sed s/proj_api/proj_api_vX_Y_Z/g 
find 50_proj -type f -name '*.sql' ! -name '00_create_schema.sql' \
  | sort \
  | xargs cat \
) | psql prod_proj_db -1 -f -

So after that, we have a fresh copy of the whole shiny API schema with all the dependencies rolled out to the production database. And this schema objects are only accessed by the software, that is supposed to do so, that is tested to run with this very combination and this versions of the stored procedures and depended types. And if we see any problems with the rollout, we can just rollback the software stack so it can still access our old stored procedures, located in a schema with previous version of out API.

This method does not solve the problem of versioning of tables in our data schema (we would keep all the tables, related objects and low level transformation stored procedures in proj_data schema) but for that, there is a very simple, but very nice, solution, http://www.depesz.com/index.php/2010/08/22/versioning/ suggested and implemented by Depesz. Of cause, changes in table structure should be still kept backwards compatible and nicely written database diff rollout and rollback files should be written for every such change.

I am not going into details about how to prepare Springs configuration of the JDBC pools for the java clients or how to configure the bootstrapping for integration testing in your Maven project configuration as this information will not add any real value to this blog post that became much longer then I expected from the beginning.

NOTE: Because of a bug in PostgreSQL JDBC driver the types that are used as input parameters for stored procedures cannot be located in different schemas (TYPE OIDs are being searched only by name only, without consideration of a schema and search_path). Patching of the driver is very easy and we did so, in my company to be able to use the schema based versioning in our Java projects. I reported the bug twice already (http://archives.postgresql.org/pgsql-jdbc/2011-03/msg00007.php, http://archives.postgresql.org/pgsql-jdbc/2011-12/msg00083.php), but unfortunately no response from anybody. Probably have to submit a patch myself sometime.

↧

Hubert 'depesz' Lubaczewski: Waiting for 9.2 – split of current_query in pg_stat_activity

January 22, 2012, 3:14 pm

≫ Next: Hubert 'depesz' Lubaczewski: Some new tools for PostgreSQL or around PostgreSQL

≪ Previous: Valentine Gogichashvili: Schema based versioning and deployment for PostgreSQL

On 19t of January, Magnus Hagander committed patch: Separate state from query string in pg_stat_activity This separates the state (running/idle/idleintransaction etc) into it's own field ("state"), and leaves the query field containing just query text. The query text will now mean "current query" when a query is running and "last query" in other [...]

↧

Hubert 'depesz' Lubaczewski: Some new tools for PostgreSQL or around PostgreSQL

January 23, 2012, 2:31 am

≫ Next: Andrew Dunstan: Using PLV8 to index JSON

≪ Previous: Hubert 'depesz' Lubaczewski: Waiting for 9.2 – split of current_query in pg_stat_activity

During last months I wrote some tools to help me with my daily duties. I’d like to let you know you about them, as you might find them useful. So, here we go: pg.logs.tail Available from OmniTI SVN. It’s a smarter “tail -f” for PostgreSQL logs. Smarter in a way, that it knows that PostgreSQL [...]

↧

Andrew Dunstan: Using PLV8 to index JSON

January 23, 2012, 4:34 am

≫ Next: Bruce Momjian: More Lessons From My Server Migration

≪ Previous: Hubert 'depesz' Lubaczewski: Some new tools for PostgreSQL or around PostgreSQL

People seem to be getting very excited about JSON in 9.2. So I just tried using the new type in combination with PLV8. It seems to work pretty well. Let's say we want to index a field in our JSON object called "x", so when we query on it the query will run fast. No problem. Here's a very simple function that gets the member out of the JSON:

CREATE or replace FUNCTION jmember (j json, key text )
 RETURNS text
 LANGUAGE plv8 
 IMMUTABLE
AS $function$

  var ej = JSON.parse(j);
  if (typeof ej != 'object')
        return NULL;
  return JSON.stringify(ej[key]);

$function$;

In reality we'd want something a bit more sophisticated than this, but you can get the idea from this. Armed with this function we could now create our index, using the functional index feature of PostgreSQL:

CREATE INDEX x_in_json ON mytable (jmember(jsonfield,'x'));

Now, when we issue a query like

SELECT *
FROM mytable
WHERE jmember(jsonfield,'x') = 'foo';

It should be able to use the index. This is reasonably analogous to a very simple use of MongoDB's ensureIndex() function.

We could make this somewhat nicer by providing some operators, and maybe building in a function like this, but the fundamental idea should work pretty much the same.

↧

Bruce Momjian: More Lessons From My Server Migration

January 23, 2012, 7:30 am

≫ Next: Andrew Dunstan: Setting up PLV8 on Fedora 16

≪ Previous: Andrew Dunstan: Using PLV8 to index JSON

The new server is 2-10 times faster than my old 2003 server, but that 10x speedup is only possible for applications that:

Do lots of random I/O, thanks to the SSDs. Postgres already supports tablespace-specific random_page_cost settings, but it would be interesting to see if there are cases that can be optimized for low random pages costs. This is probably not an immediate requirement because the in-memory algorithms already assume a low random page cost.
Can be highly parallelized. See my previous blog entry regarding parallelism. The 16 virtual cores in this server certainly offer more parallelism opportunities than my old two-core system.

Other observations:

It takes serious money to do the job right, roughly USD $4k — hopefully increased productivity and reliability will pay back this investment.
I actually started the upgrade two years ago by adjusting my scripts to be more portable; this made the migration go much smoother. The same method can be used for migrations to Postgres by rewriting SQL queries to be more portable before the migration. Reliable hardware is often the best way to ensure Postgres reliability.
My hot-swappable SATA-2 drive bays allow for a flexible hard-drive-based backup solution (no more magnetic tapes). File system snapshots allow similar backups for Postgres tablespaces, but it would be good if this were more flexible. It would also be cool if you could move a drive containing Postgres tablespaces from one server to another (perhaps after freezing the rows).

↧

Andrew Dunstan: Setting up PLV8 on Fedora 16

January 23, 2012, 9:32 am

≫ Next: Bruce Momjian: The Most Important Postgres CPU Instruction

≪ Previous: Bruce Momjian: More Lessons From My Server Migration

Fedora 16 ships with v8 (I'm not sure how far it goes back, Fedora 15 at least), which makes installing PLV8 extremely easy. Here's what I did earlier today. It took about a minute. I already had an installed and running instance of Postgres where I wanted PLV8 installed. So I did this:

cd inst.json
sudo yum install v8 v8-devel
hg clone https://code.google.com/p/plv8js/
cd plv8js
PATH=../bin:$PATH make USE_PGXS=1
PATH=../bin:$PATH make USE_PGXS=1 install
cd ..
bin/createdb testplv8
bin/psql -c 'create extension plv8; create language plv8;' testplv8

Pretty simple, very quick.

↧

Bruce Momjian: The Most Important Postgres CPU Instruction

January 24, 2012, 6:30 am

≫ Next: Christophe Pettus: PostgreSQL Performance When It’s Not Your Job

≪ Previous: Andrew Dunstan: Setting up PLV8 on Fedora 16

Postgres consists of roughly 1.1 million lines of C code, which is compiled into an executable with millions of CPU instructions. Of the many CPU machine-language instructions in the Postgres server executable, which one is the most important? That might seem like an odd question, and one that is hard to answer, but I think I know the answer.

You might wonder, "If Postgres is written in C, how would we find the most important machine-language instruction?" Well, there is a trick to that. Postgres is not completely written in C. There is a very small file (1000 lines) with C code that adds specific assembly-language CPU instructions into the executable. This file is called s_lock.h. It is an include file that is referenced in various parts of the server code that allows very fast locking operations. The C language doesn't supply fast-locking infrastructure, so Postgres is required to supply its own locking instructions for all twelve supported CPU architectures. (Operating system kernels do supply locking instructions, but they are much too slow to be used for Postgres.)

↧

Christophe Pettus: PostgreSQL Performance When It’s Not Your Job

January 24, 2012, 10:03 pm

≫ Next: Pavel Golub: Joomla! 2.5 with PostgreSQL support officially released

≪ Previous: Bruce Momjian: The Most Important Postgres CPU Instruction

My presentation from SCALE 10x, “PostgreSQL Performance When It’s Not Your Job” is now available for download.

↧

Pavel Golub: Joomla! 2.5 with PostgreSQL support officially released

January 25, 2012, 4:49 am

≫ Next: Bruce Momjian: Increasing Database Reliability

≪ Previous: Christophe Pettus: PostgreSQL Performance When It’s Not Your Job

Joomla, one of the world’s most popular open source content management systems (CMS) used for everything from websites to blogs to Intranets, today announces the immediate availability of Joomla 2.5. Along with new features such as advanced search and automatic notification of Joomla core and extension updates, the Joomla CMS for the first time includes multi-database support with the addition of Microsoft SQL Server. Previous versions of Joomla were compatible exclusively with MySQL databases.

Way to go Joomla! But why don’t you guys mention PostgreSQL database in the main release story? Do you really think that MSSQL is more common choice for the database layer for any CMS? Seriously?

Filed under: Announces, PostgreSQL Tagged: CMS, Joomla!, microsoft sql server, PostgreSQL, release

↧

Bruce Momjian: Increasing Database Reliability

January 25, 2012, 6:15 am

≫ Next: Andrew Dunstan: Kudos to Perl team

≪ Previous: Pavel Golub: Joomla! 2.5 with PostgreSQL support officially released

While database software can be the cause of outages, for Postgres, it is often not the software but the hardware that causes failures — and storage is often the failing component. Magnetic disk is one of the few moving parts on a computer, and hence prone to breakage, and solid-state drives (SSDs) have a finite write limit.

While waiting for storage to start making loud noises or fail is an option, a better option is to use some type of monitoring that warns of storage failure before it occurs, e.g. enter SMART. SMART is a system developed by storage vendors that allows the operating system to query diagnostics on the drive and warn of unusual storage behavior before failure occurs. While read/write failures are reported by the kernel, SMART parameters often warn of danger before failure occurs. Below is the SMART output from a Western Digital (WDC) WD20EARX magnetic disk drive:

↧