Jeff Frost: pg_dump compression settings

December 23, 2011, 10:22 am

≫ Next: Andrew Dunstan: Making pg_get_viewdef usable

≪ Previous: Dave Page: Updated PostgreSQL Download Infrastructure

After doing the base backup benchmarks, I thought it would be interesting to benchmark pg_dump run locally and remotely using all the different compression settings. This will allow me to compare the dump times and the space savings as well as see how much dumping remotely slows the process. Nobody is storing their dumps on the same server, right?

For the record, all these dumps used pg_dump's custom dump format (-Fc), the PGDATA directory is 93GB on disk and resides on a 4 disk RAID10 of 7200 RPM SATA drives.

CPU on the test server: Intel(R) Xeon(R) CPU X3330 @ 2.66GHz.
CPU on the test remote client: Intel(R) Core(TM) i7 CPU 960 @ 3.20GHz.
Network is gigabit.
The total size reported is from du -sh output.
Both client and test server are running postgresql-9.1.2.

For comparison, our best base backup time from the base backup blog post was: 15m52.221s

The results:

Local pg_dump:
-------------------
compression: 0
time: 19m7.455s
total size: 77G
-------------------
compression: 1
time: 21m53.128s
total size: 11G
-------------------
compression: 2
time: 22m27.507s
total size: 11G
-------------------
compression: 3
time: 24m18.966s
total size: 9.8G
-------------------
compression: 4
time: 30m10.815s
total size: 9.2G
-------------------
compression: 5
time: 34m26.119s
total size: 8.3G
-------------------
compression: 6
time: 41m35.340s
total size: 8.0G
-------------------
compression: 7
time: 49m4.484s
total size: 7.9G
-------------------
compression: 8
time: 91m28.689s
total size: 7.8G
-------------------
compression: 9
time: 103m24.883s
total size: 7.8G

Remote pg_dump:
-------------------
compression: 0
time: 20m1.363s
total size: 77G
-------------------
compression: 1
time: 22m9.205s
total size: 11G
-------------------
compression: 2
time: 22m19.158s
total size: 11G
-------------------
compression: 3
time: 23m7.426s
total size: 9.8G
-------------------
compression: 4
time: 26m10.383s
total size: 9.2G
-------------------
compression: 5
time: 28m57.431s
total size: 8.3G
-------------------
compression: 6
time: 33m23.939s
total size: 8.0G
-------------------
compression: 7
time: 38m11.321s
total size: 7.9G
-------------------
compression: 8
time: 62m27.385s
total size: 7.8G
-------------------
compression: 9
time: 72m5.123s
total size: 7.8G

So, a few interesting observations:

Base backups are indeed faster, but that's no surprise since they have less overhead.
Taking the backup locally is only slightly faster than remotely for the smaller compression levels
At compression level 3 and above the faster CPU on the test client becomes a huge benefit
Looks like the default compression is level 6. This might not be what you want if you're concerned about minimizing the backup time and have the disk space to spare.

↧

Andrew Dunstan: Making pg_get_viewdef usable

December 23, 2011, 11:11 am

≫ Next: Jared Watkins: What I do – Dynamic Daily Table Partitions With Postgres

≪ Previous: Jeff Frost: pg_dump compression settings

For years some people (including me) have been annoyed by the fact that pg_get_viewdef() runs all the fields together more or less on one line, even in pretty printing mode. It makes the output for large views pretty unusable by humans. So I've been working on a fix for it. My preferred solution would be to put each field on its own line (or lines in a few cases), but some people felt this would use too much vertical space. So I've been working on a solution that's a bit more flexible than that, with the default pretty printing mode wrapping at 80 characters per line. Here's a screen shot that demonstrates the effects. I still like the last one there best, though.

↧

Jared Watkins: What I do – Dynamic Daily Table Partitions With Postgres

December 23, 2011, 12:37 pm

≫ Next: Leo Hsu and Regina Obe: Mail Merging using Hstore

≪ Previous: Andrew Dunstan: Making pg_get_viewdef usable

As part of a new and fairly large project I have a need to partition a few postgres tables and have a rolling daily window. That is.. I want to organize data by a timestamp storing each day in its own partition and maintain 90 days of historical data. Doing this is possible in Postgres but it’s not pretty or very clean to set it up. To simplify the process I wrote this perl script that (when run daily) will pre-create a certain number of empty partitions into the future and remove the oldest partitions from your window.

The script is generalized so as to be easy to modify and there isn’t much here that’s specific to postgres.. so it could easily be adapted for use with other systems like Oracle. You will need to put in the DDL for the sub tables you will create but otherwise it’s pretty straight forward. Please let me know if you find this useful as I couldn’t find anything else out there like it.

pgDynamicPartitions.pl

Expand to see how you call it.

Source code

This script implements a rolling window of _daily_ partitions in a postgres
database. This means.. you partition a table on some timestamp/date column
and maintain a fixed number of days into the past and pre-create empty
partitions x days into the future. It's intended to be called daily to
drop the oldest partition and create a new one for future use. However,
by calling it on a different schedule and with larger windows it could be
used for weekly or monthly schedules etc. 
 
To use this you will need to customize the routine that creates your child
table(s) along with their associated indexes and other parameters. The
script is made to be generalized.. so that you may pre-define several child
table templates and then call this script from cron with different
window periods for each table. 
 
Syntax
 
pgDynamicPartitions.pl	[-h dbhost] [-d database] [-t table] [-p past] [-f future]
 
    -h [host]           DB Host to operate on
    -d [database]	Database to connect to
    -t [table]          Parent table name to partition
    -p [days]           How many daily partitions to keep into the past
                        Ones older than this will be dropped.
    -f [days]           How many empty daily partitions to be created into the future.
    -n 			Don't make changes just show what would be done.

↧

Leo Hsu and Regina Obe: Mail Merging using Hstore

December 27, 2011, 3:01 pm

≫ Next: Pavel Golub: Determination of a leap year in PostgreSQL

≪ Previous: Jared Watkins: What I do – Dynamic Daily Table Partitions With Postgres

For those who aren't familiar with hstore, it's a key/value storage type that is packaged as an extension or contrib in PostgreSQL 8.2+. In PostgreSQL 9.0 it got a little extra loving in several ways one of which was the introduction of the hstore(record) casting function that converts a record to an hstore. In this article, I'll demonstrate how you can use this new casting function to do very sleek mail merges right in the database. The only caveat is that it seems to only correctly name the keys if it is fed a real table or view. Derived queries such as aggregates etc get keys named f1, f2, etc.

If you are on PostgreSQL 9.1 or above installing -- hstore is just a CREATE EXTENSION hstore; command away. If you are on a lower version of PostgreSQL, you can usually find the hstore.sql in share/contribs.

Continue reading "Mail Merging using Hstore"

↧

Pavel Golub: Determination of a leap year in PostgreSQL

December 28, 2011, 3:28 am

≫ Next: John DeSoi: pgEdit 2.1 released

≪ Previous: Leo Hsu and Regina Obe: Mail Merging using Hstore

From Wikipedia:

A leap year (or intercalary or bissextile year) is a year containing one extra day (or, in the case of lunisolar calendars, a month) in order to keep the calendar year synchronized with the astronomical or seasonal year. Because seasons and astronomical events do not repeat in a whole number of days, a calendar that had the same number of days in each year would, over time, drift with respect to the event it was supposed to track. By occasionally inserting (or intercalating) an additional day or month into the year, the drift can be corrected. A year that is not a leap year is called a common year.

So to determine whether a year is a leap year or not in either the Gregorian calendar we need to check such condition:

(Year mod 4 = 0) AND ((Year mod 100 != 0) or (Year mod 400 = 0)),

where mod is modulo operation

This dirty query does the trick then:

SELECT c.y, (c.y % 4 = 0) AND ((c.y % 100 <> 0) OR (c.y % 400 = 0)) 
 FROM generate_series(1582, 2020) AS c(y)

I prefer to have function for this task. First I thought it must be written in plpgsql to be fast enough. Yeah, I’m to lazy for C functions. However after tests I saw that SQL function has the same productivity. Don’t know why. Here the sources for both of them:

CREATE OR REPLACE FUNCTION isleapyear(year integer)
 RETURNS boolean AS
'SELECT ($1 % 4 = 0) AND (($1 % 100 <> 0) or ($1 % 400 = 0))'
 LANGUAGE sql IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION isleapyear(year integer) RETURNS boolean AS 'BEGIN RETURN (Year % 4 = 0) AND ((Year % 100 0) or (Year % 400 = 0)); END' LANGUAGE plpgsql IMMUTABLE STRICT;

P.S. Some funny traditions:

On the British isles, it is a tradition that women may propose marriage only on leap years.
In Denmark, the tradition is that women may propose on the bissextile leap day, February 24, and that refusal must be compensated with 12 pairs of gloves.
In Finland, the tradition is that if a man refuses a woman’s proposal on leap day, he should buy her the fabrics for a skirt.

Filed under: Coding Tagged: leap year, plpgsql, PostgreSQL, SQL, trick

↧

John DeSoi: pgEdit 2.1 released

December 30, 2011, 5:57 pm

≫ Next: Andrew Dunstan: Filtering failures

≪ Previous: Pavel Golub: Determination of a leap year in PostgreSQL

pgEdit has been updated to support PostgreSQL 9.1. The syntax highlighting grammar now supports all 9.1 keywords and the documentation tools recognize all SQL commands.

The variable PGEDIT_PSQL_PATH can be set in TextMate preferences to customize psql executable name and/or location.

See the Download page for the latest release of pgEdit.

↧

Andrew Dunstan: Filtering failures

January 2, 2012, 7:51 am

≫ Next: Chris Travers: A Reply to Tony Marston's advocacy of unintelligent databases

≪ Previous: John DeSoi: pgEdit 2.1 released

I have just made available a facility to filter what you see on the buildfarm failures page. Now you can choose the period it shows (up to the last 90 days, default is 10 days), and which branches, members and/or failure stages are shown.

↧

Chris Travers: A Reply to Tony Marston's advocacy of unintelligent databases

January 2, 2012, 7:59 am

≫ Next: Joe Abbate: Business Logic in the Database

≪ Previous: Andrew Dunstan: Filtering failures

Tony Marston has published an interesting critique of my posting about why intelligent database are helpful. The response is thought-provoking and I suggest my readers read it. Nonetheless I believe that the use cases where he is correct are becoming more narrow over time, and so I will explain my thoughts here.

A couple preliminary points must be made though. There is a case to be made for intelligent databases, for NoSQL, and for all sorts of approaches. These are not as close to a magic bullet as some proponents would like to think and there will be cases where each approach wins out, because design is the art of endlessly making tradeoffs. My approach has nothing to do with what is proper and has to do instead with preserving the ability to use the data in a other ways for other applications, as I see this as the core strength of relational database management systems. Moreover I used to agree with Tony but I have since changed my mind primarily because I have begun working with data environments where the stack assumption for application design doesn't work well. ERP is a great example and I will go into why below.

Of course in some cases data doesn't need to be reused, and an RDBMS may still be useful. This is usually the case where ad-hoc reporting is a larger requirement than scalability and rapid development. In other cases where data reuse is not an issue an RDBMS really brings nothing to the table and NoSQL solutions of various sorts may be better.

My own belief is that a large number of database systems out there operate according to the model of one database to many applications. I also think that the more that this model is understood the more that even single-application-database designers can design with this in mind or at least ask if this is the direction they want to go.

And so with the above in mind, on to the responses to specific points:

Well, isn't that what it is supposed to be? The database has always been a dumb data store, with all the business logic held separately within the application. This is an old idea which is now reinforced by the single responsibility principle. It is the application code which is responsible for the application logic while the database is responsible for the storing and retrieval of data. Modern databases also have the ability to enforce data integrity, manage concurrency control, backup, recovery and replication while also controlling data access and maintaining database security. Just because it is possible for it to do other things as well is no reason to say that it should do other things.

The above opinion works quite well in a narrow use case, namely where one application and only one application uses the database. In this case, the database tables can be modelled more or less directly after the application's object model and then the only thing the RDBMS brings to the table is some level of ad hoc reporting, at the cost of added development time and complexity compared to some NoSQL solutions. (NoSQL solutions are inherently single application databases and therefore are not usable where the considerations below exist.)

The problem is that when you start moving from single applications into enterprise systems, the calculation becomes very different. It is not uncommon to have several applications sharing a single database, and the databases often must be designed to make it quick and easy to add new applications on top of the same database.

However as soon as you make this step something awful happens when you try to use the same approach used for single application databases: because the database is based on the first application's object model, all subsequent applications must have intimate knowledge of the first application's object model, which is something of an anti-pattern.

A second major problem also appears at the same time, and that is that while a single application can probably be trusted to enforce meaningful data constraints, applications can't and shouldn't trust eachother to input meaningful data. The chance of oversights where many applications are each responsible for checking all inputs for sanity and ensuring that only meaningful data is stored is very high, and the consequences can be quite severe. Therefore things like check constraints become suddenly irreplaceable when one makes the jump from one data entry application to two against the same database.

Just because a database has more features does not mean that you should bend over backwards to use them. The English language contains a huge number of words that I have never used, but does that make my communication with others unintelligible? I have used many programming languages, but I have rarely used every function that has ever been listed in the manual, but does that mean that my programs don't work? With a relational database it is possible to use views, stored procedures, triggers or other advanced functionality, but their use is not obligatory. If I find it easier to do something in my application code, then that is where I will do it.

Indeed. However, if you aren't thinking in terms of what the RDBMS can do, but rather just in terms of it as a dumb storage layer, you will miss out on these features when they are of benefit. Therefore it is important to be aware of the costs of this when asking "should my database be usable by more than one application?"

That question becomes surprisingly useful to ask when asking questions like "should data be able to be fed into the database by third party tools?"

Where did this idea come from? The application has to access the database at some point in time anyway, so why do people like you keep insisting that it should be done indirectly through as many intermediate layers as possible? I have been writing database applications for several decades, and the idea that when you want to read from or write to the database you should go indirectly through an intermediate component instead of directly with an SQL query just strikes me as sheer lunacy. This is a continuation of the old idea that programmers must be shielded completely from the workings of the database, mainly because SQL is not objected oriented and therefore too complicated for their closed minds. If modern programmers who write for the web are supposed to know HTML, CSS and Javascript as well as their server-side language, then why is it unreasonable for a programmer who uses a database to know the standard language for getting data in and out of a database?

The idea comes from the necessities of ensuring that the database is useful from more than one application. Once that happens, then the issue is one of details of too much intimate knowledge of data structures between components.

However one thing I am not saying is that SQL-type interfaces are categorically out. It would be a perfectly reasonable approach to encapsulation to build updatable views matching the object models of each application, and indeed reporting frameworks perhaps should be built this way to the extent possible, at least where multiple applications share a database. The problem this solves is accommodating changes to the schema required by one application but not required by a second application. One could even use an ORM at that point.

If your programmers have to spend large amounts of time in writing boilerplate code for basic read/write/update/delete operations for your database access then you are clearly not using the right tools. A proper RAD (Rapid Application Development) toolkit should take care of the basics for you, and should even ease the pain of dealing with changes to the database structure. Using an advanced toolkit it should possible to create working read/write/update/delete transactions for a database table without having to write a single line of SQL. I have done this with my RADICORE toolkit, so why can't you? Simple JOINs can be handled automatically as well, but more complex queries need to be specified within the table class itself. I have implemented different classes for different database servers, so customers of my software can choose between MySQL, PostgreSQL, Oracle and SQL Server.

How does this work though when you have multiple applications piping heavy read/write workloads through the same database, but where each application has unique data structure requirements and hence its own object model?

The very idea that your software data structure should be used to generate your database schema clearly shows that you consider your software structure, as produced from your implementation of Object Oriented Design (OOD) to be far superior to that of the database.....

It seems I was misunderstood. I was saying that the generation of database structures from Rails is a bad thing. I am in complete agreement with Tony on that part I think.

You may think that it slows down development, but I do not. You still have to analyse an application's data requirements before you can go through the design process, but instead of going through Object Oriented design as well as database design I prefer to ignore OOD completely and spend my time in getting the database right. Once this has been done I can use my RAD toolkit to generate all my software classes, so effectively I have killed two birds with one stone. It also means that I don't have to waste my time with mock objects while I'm waiting for the database to be built as it is already there, so I can access the real thing instead of an approximation. As for using an ORM, I never have the problem that they were designed to solve, so I have no need of their solution.

This is an interesting perspective. I was comparing it though to initial stages of design and development in agile methodologies.

Here's the basic thing. Agile developers like to start out when virtually nothing is known about requirements with a prototype designed to flesh out the requirements and refine that prototype until you get something that arguably works.

I am saying that in the beginning at least, asking the question of what the data relating to topics being entered is and how to model it neutral of business rules.

I would agree though that the time spent there on looking at the data, modelling issues, and normalization is usually recouped later, so I suppose we are in agreement.

This is another area where OO people, in my humble opinion, make a fundamental mistake. If you build your classes around the database structure and have one class per table, and you understand how a database works, you should realise that there are only four basic operations that can be performed on a database table - Create, Read, Update and Delete (which is where the CRUD acronym comes from). In my framework this means that every class has an insertRecord(), getData(), updateRecord() and deleteRecord() method by default, which in turn means that I do not have to waste my time inventing unique method names which are tied to a particular class. Where others have a createCustomer(), createProduct(), createOrder() and createWhatever() method I have the ubiquitous insertRecord() method which can be applied to any object within my application. This makes use of the OO concept of polymorphism, so it should be familiar to every OO programmer. Because of this I have a single page controller for each of the methods, or combination of methods, which can be performed on an object, and I can reuse the same page controller on any object within my application. I have yet to see the same level of reuse in other application frameworks, but isn't a high level of code reuse what OOP is supposed to be about?
I am often told "but that is not the way it is done!" which is another way of saying "it is not the way I was taught". This tells me that either the critic's education was deficient, or he has a closed mind which is not open to other techniques, other methods or other possibilities.

The problem occurs when CRUD operations are not really as simple as this sounds, and where they must have complex constraints enforced from multiple data entry applications. A good example here is posting GL transactions, where each transaction must be balanced. This is far simpler to do in a stored procedure than anything else, because the set has certain emergent constraints that apply beyond the scope of a single record. Also if abstracting the application interface from the low-level storage, then this may be necessary as part of the mapping of views to relations.

I can even make changes to a table's structure, such as adding or deleting a column, or changing a column's size or type, without having to perform major surgery on my table class. I simply change the database, import the new structure into my data dictionary, and then export the changes to my application. I don't have to change any code unless the structural changes affect any business rules. My ERP application started off with 20 database tables but this has grown to over 200 over the years, so I am speaking from direct experience.

So if you are doing an ERP application and at the same time putting all business logic in your application, you do not expect any third party applications to hit your database, right? So really you get the same basic considerations I am arguing for by making the database into the abstraction layer rather than encapsulating the db inside an abstraction layer.

If we think about it this way, it depends on what the platform is that you are developing on. We like to have the database be that platform, and that means treating interfaces to it as an API. Evidently you prefer to force all access to go through your application. I think we get a number of benefits from this approach including language neutrality for third party components, better attention to performance on complex write operations, and more. That there is a cost cannot be argued with however.

I would conclude by saying that we agree on a fair bit of principles of database design. We agree normalization is important, and I would add that this leads to application-neutral data storage. Where we disagree is where the API/third party tool boundary should be. I would prefer to put in the database whatever seems to belong to the general task of data integrity, storage, retrieval, and structural presentation while leaving what we do with this data to the multiple third party applications which utilize the same database.

↧

Joe Abbate: Business Logic in the Database

January 2, 2012, 12:20 pm

≫ Next: Andrew Dunstan: Buildfarm status emails

≪ Previous: Chris Travers: A Reply to Tony Marston's advocacy of unintelligent databases

Chris Travers recently responded to Tony Marston’s critique of an earlier post where Chris advocated “intelligent databases”¹. Chris’ response is well reasoned, particularly his point that once a database is accessed by more than a single application or via third-party tools, it’s almost a given that one should attempt to push “intelligence” and business logic into the database if possible.

However, there is a paragraph in Tony’s post that merits further scrutiny:

The database has always been a dumb data store, with all the business logic held separately within the application. This is an old idea which is now reinforced by the single responsibility principle. It is the application code which is responsible for the application logic while the database is responsible for the storing and retrieval of data. Modern databases also have the ability to enforce data integrity, manage concurrency control, backup, recovery and replication while also controlling data access and maintaining database security.

If a database (actually, a shortening of DBMS—a combination of the data and the software that manages it) has always been dumb, then presumably one would never specify UNIQUE indexes. It is a business requirement that invoice or employee numbers be unique, so if all the business logic should reside in the application, then the DBA should only create a regular index and the application —all the applications and tools!— should enforce uniqueness.

Tony’s mention of “data integrity” is somewhat ambiguous because different people have varied understandings of what that covers. As C. J. Date points out, “integrity … is the part [of the relational model] that has changed [or evolved] the most over the years.”² Perhaps Tony believes that primary keys, and unique and referential constraints should be enforced by the DBMS, but apparently an integrity constraint such as “No supplier with status less than 20 supplies any part in a quantity greater than 500″³ should instead only be code in an application (make that all applications that access that database).

As for me, as I pointed out earlier, “constraints should be implemented, preferably declaratively, in the database schema” while “type constraints … should also be implemented in the user interface” (emphasis added). Ideally, the user interface should derive its code directly from the schema.

¹ Interestingly, the first time I heard the term “intelligent database” it was from Dave Kellogg giving a marketing presentation for Ingres 6.3, which had incorporated some of the features in the UC Berkeley POSTGRES project.
² Date, C. J. An Introduction to Database Systems, 8th Edition. 2004, p. 253.
³ Ibid., p. 254.

Filed under: PostgreSQL

↧

Andrew Dunstan: Buildfarm status emails

January 3, 2012, 7:19 am

≫ Next: Tomas Vondra: Fulltext with dictionaries in shared memory

≪ Previous: Joe Abbate: Business Logic in the Database

Apparently the fact that the buildfarm sends out email status notifications is one of the community's best kept secrets. It was one of the earliest features. There are four mailing lists for status that can be subscribed to, according to your taste. Please visit the lists page on pgFoundry if you want to find out when things break (or get fixed).

↧

Tomas Vondra: Fulltext with dictionaries in shared memory

January 3, 2012, 3:06 pm

≫ Next: Andrew Dunstan: Blue sky

≪ Previous: Andrew Dunstan: Buildfarm status emails

If you're using the fulltext built-in to PostgreSQL and if you can't use solution based on snowball due to the nature of the language (as it works great for english, but there's not anything similar for czech with reasonable accuracy and probably never will be), you're somehow forced to use ispell based dictionaries. In that case you've probably noticed two annoying features.

For each connection (backend), the dictionaries are loaded into private memory, i.e. each connection spends a lot of CPU parsing the dictionary files on the first query, and it needs a lot of memory to store the same information. If the parsed dictionary needs 25MB (and that's not an exception) and if you do have 20 concurrent connections using the fulltext, you've suddenly lost 500 MB of RAM. That may seem like a negligible amount of RAM nowadays, but there are environments where this is significant (e.g. VPS or some cloud instances).

There are workarounds - e.g. using a connection pool with already initialized connections (you'll skip the initialization time but you're wasting memory) or keeping small number of persistent connections just for fulltext queries (but that's inconvenient to work with). There are probably other solutions but none of them is perfect ...

Recent issues with the fulltext forced me to write an extension that allows storing the dictionaries in the shared memory. Even this solution is not perfect (more on that later), but it's definitely a step in the right direction. So how does it work and what it does?

↧

Andrew Dunstan: Blue sky

January 4, 2012, 7:46 am

≫ Next: Hubert 'depesz' Lubaczewski: OmniPITR 0.3.0

≪ Previous: Tomas Vondra: Fulltext with dictionaries in shared memory

Around this time of year I generally take a bit of time to think what I want to work on during the coming year, and what I want to write talks on for conferences. I have a couple of things I need to pay some attention to:

PLV8
Cleaning up my text file Foreign Data Wrappers

Beyond that, some of the things I want to see done and I have some interest in working on include, in no particular order:

LATERAL subqueries
Window functions for PL/Perl
GROUPING SETS including CUBE and ROLLUP
making a better API for and rewriting xpath_table()
retail CREATE statement generation for any object

Of course, I can't work on all of these or even most of them. But with any luck I'll work on one or two (I've begun in one case) and with a liitle more luck I'll find subjects there for a talk or two.

Oh, and I really wish someone (not me) would work on MERGE, or at least the REPLACE/UPSERT piece of it, but it seems to be something of a graveyard for development efforts. It's by far the most important missing feature, in my far from humble opinion.

↧

Hubert 'depesz' Lubaczewski: OmniPITR 0.3.0

January 4, 2012, 12:44 pm

≫ Next: Keith: PG Extractor - A smarter pg_dump

≪ Previous: Andrew Dunstan: Blue sky

Just released version 0.3.0 of our tool for handling WAL based replication in PostgreSQL – OmniPITR. Version jump is related to addition of another tool – omnipitr-synch. This tool is used to copy PostgreSQL data dir (including all tablespaces of course) to remote location(s). While this process is usually simple (call pg_start_backup(), transfer data, call [...]

↧

Keith: PG Extractor - A smarter pg_dump

January 5, 2012, 10:39 am

≫ Next: Denish Patel: What is pg_extractor ?

≪ Previous: Hubert 'depesz' Lubaczewski: OmniPITR 0.3.0

For my debut blog post, I'll be explaining a tool, pg_extractor, that I began working on at my current job. I'd like to first give a big thank you to depesz and the DBA team at OmniTI for their help in making this tool.

We had already had a tool in use for doing a one-file-per-object schema dump of PostgreSQL databases for use in sending to version control. It was originally designed with some complex queries to fetch the internal table, view & function schema, but with the release of 9.x, many of those queries broke with internal structure changes. It also didn't account for partitioned tables with inheritance, overloaded functions and several other more advanced internal structures. I'd originally set out to just fix the schema queries to account for these changes and shortcomings but depesz mentioned his idea for a complete reworking of the tool using pg_dump/pg_restore instead of complex queries. This would hopefully keep it as future proof as possible and keep the output in a format that was better guaranteed to be a representation of how PostgreSQL sees its own object structure, no matter how it may change in the future.

SVN options have been integrated into the tool and I have plans on getting Git in as well. While the ideal way to track a database's schema in VCS would be before you actually commit it to your production system, this was an easy way to just have the schema available for reference and automatically track changes outside of the normal development cycle.

Working on this tool has shown many shortcomings with the built in pg_dump/restore. One of my hopes with this project is to show what improvements can be made and possibly built in sometime in the future. One of the biggest is the lack of outputting of ACLs and Comments for anything but tables. And if you have many individual objects you'd like to export, formatting that for pg_dump can be quite tedious. And there is a lack of filtering for anything other than tables and schemas with pg_dump. The filter options list for pg_extractor has become quite long and will probably only continue to grow. Just do a --help or see the documentation on GitHub. I think one of the nicest features is the ability to use external files as input to the filter options and also filter with regex!
Probably best to move on to some examples instead of continuing this wall of text...

perl pg_extractor.pl -U postgres --dbname=mydb --getall --sqldump

This is a very basic use of the tool using many assumed defaults. It uses the postgres database user, grabs all objects (tables, views, functions/aggregates, custom types, & roles), and also makes a permanent copy of the pg_dump file used to create the individual schema files. I tried to use options that were similar to, or exactly the same as, the options for pg_dump/restore. There's only so many letters in the alphabet though, so please check the documentation to be sure. Just like pg_dump, it will also use the PG environment variables $PGDATABASE, $PGPORT, $PGUSER, $PGHOST, $PGPASSFILE, $PGCLIENTENCODING if they are set (they are actually used internally by the script as well if the associated options are used).

perl pg_extractor.pl -U postgres --dbname=mydb --getfuncs --n=keith

This will extract only the functions from the "keith" schema. Any overloaded functions are put into the same file. I'd thought about trying to use the parameters to somehow make unique names for each version of an overloaded function, but I found this much easier for now, especially when going back and removing files if the --delete cleanup option is set.

perl pg_extractor.pl -U postgres --dbname=mydb --getfuncs -p_file=/home/postgres/func_incl --n=dblink

This will extract only specifically named functions in the given filename. You must ensure that the full function signature is given with only the variable types for arguments. When using include files, it's best to explicitly name the schemas that the objects in the file are in as well (it makes the temporary dump file that's created smaller). The contents of any external file list are newline separated like so

dblink_exec(text, text)
dblink_exec(text, text, boolean)
dblink_exec(text)
dblink_exec(text, boolean)

One of the things I ensured was possible for external file lists was that you can use an object list created with psql when using the \t and \o options

postgres=# \t
Showing only tuples.
postgres=# \o table_filter
postgres=# select schemaname || '.' || tablename from pg_tables;

You could then use the "table_filter" file with the --t_file or --T_file options. I use this ability at work for a cronjob that generates a specific list of views I want exported and tracked in SVN. But that view list is not static and can change at any time. So the cronjob generates a new view filter file every day.

perl pg_extractor.pl -U postgres --dbname=mydb --gettables -Fc --t_file=/home/postgres/tbl_incl --getdata

This extracts only the tables listed in the given filename along with the data in the pg_dump "custom" format. By default, pg_extractor's output will all be in "plain" (-Fp) format since the intended purpose of the tool is to create output in a human-readable format.

perl pg_extractor.pl -U postgres --dbname=mydb --pgdump=/opt/pgsql/bin/pg_dump --pgrestore=/opt/pgsql/bin/pg_restore --pgdumpall=/opt/pgsql/bin/pg_dumpall --getall --regex_excl_file=/home/postgres/part_exclude --N=schema1,schema2,schema3,schema4

This excludes objects (most likely table partitions) that have the pattern objectname_pYYYY_MM or fairly similar. Binaries are also not in the $PATH and it EXCLUDES several schemas.
part_exclude file contains: _p_?(20|19)\d\d(_?\d+)*$

perl pg_extractor.pl -U postgres --dbname=mydb --svn --svncmd=/opt/svn/bin/svn --commitmsg="Weekly svn commit of postgres schema" --svndel

This is an example using svn. To keep the svn password out of system logs, the svn username & password have to be manually entered into the script file. It's the only option that requires any manual editing of the source code. May see about having this as an option that points to an external file instead. The --svndel option cleans up any objects that are removed from the database from the svn repository.
Update: Writing up this blog post got me motivated to fix having to edit any source to get a certain option to work. There is now an --svn_userfile option where you can give it the path to a file containing the svn username and password to use. See the help for formatting. Just set permissions appropriately on this file for security and ensure the user running pg_extractor can read it.

One of the first fun issues I came across was handling objects that have special characters in their names. It's rare, but since individual objects are all in their own files based on their object names, the OS doesn't particularly like that, and caused some rather annoying errors every time the script ran. This is now handled by hex-encoding the special character and preceding that character with a comma in the file name.
For example: table|name becomes table,7cname.sql
This hopefully makes it easy to decode if needed by any other tools. It's actually done internally in the tool when you want objects that were deleted to also have their files deleted on subsequent runs.

One issue that I still haven't solved (and make note of in the source where it would possibly be handled) is that the signature for a function when you do a pg_restore -l is different for any comments that are also associated with that function if variables are named in the parameter list.

keith@pgsql:~$ pg_restore -l pgdump.pgr 
...
14663; 1255 16507 FUNCTION keith do_something(integer, text) postgres
48818; 0 0 COMMENT keith FUNCTION do_something(data_source_id integer, query text) postgres
...

Since looping back over the pg_restore -l list is how I find overloaded functions, ACLs and comments, it means it's very hard to match those comments with those functions in all cases since PostgreSQL has variable types with spaces in them. If anyone has any solutions for this, I'd appreciate feedback. But this honestly seems like a bug to me since only the variable type in the parameter list is actually needed to uniquely identify a function. If it's ok for the function definition itself, it should be ok for the associated comments as well.

As I said in the beginning, I will hopefully get Git integration done as well. Another option I'm working on right now is a filter for objects where a role has permissions on it, not just ownership (which there is an option for!). For example, dump all objects that the role "keith" has permission on. Maybe even more specific permissions.

I hope others will find this as useful as I have. It's one of my first large projects released into the public, so I'd appreciate any constructive feedback on code quality and bugs found. I've enjoyed working on it and learned a great deal in the process.

https://github.com/omniti-labs/pg_extractor

Tags:

↧

Denish Patel: What is pg_extractor ?

January 5, 2012, 3:03 pm

≫ Next: David Keeney: Stub: Multiple Rows

≪ Previous: Keith: PG Extractor - A smarter pg_dump

In my recent blog post, I wrote about PostgreSQL DBA Handyman toolset. In the list of tools, getddl is one of them. If you are using getddl to get DDL schema and track the daily changes in SVN for production databases, you should consider moving that process to use pg_extractor instead. pg_extractor is the more advance and robust tool for extracting schema as well data using pg_dump. Keith Fiske, an author of the tool, described tool in detail in his blog post. Thanks to Keith for making the schema extraction tool more robust and taking it to next level !

Hopefully, it will help you to have more control over your database in smarter way!

↧

David Keeney: Stub: Multiple Rows

January 6, 2012, 8:37 am

≫ Next: Andrew Dunstan: How not to use a bug tracker

≪ Previous: Denish Patel: What is pg_extractor ?

Multiple records returned by a function.

Example:

convert integers to words
'41' => 'forty one'
'17' => 'seven teen'
'501' => 'five hundred one'

illustrate converting one integer, and then returning a whole series.

Also shows use of array variables

CREATE OR REPLACE FUNCTION namenumbers( n INTEGER )
RETURNS VARCHAR
AS $$
DECLARE
ary INTEGER[];
BEGIN
ary[0] := 1;
RETURN 'one';
END;
$$ LANGUAGE plpgsql;

SELECT namenumbers(%s);

↧

Andrew Dunstan: How not to use a bug tracker

January 6, 2012, 9:31 am

≫ Next: Mark Wong: PDXPUG: January Meeting

≪ Previous: David Keeney: Stub: Multiple Rows

From time to time suggestions are made that the PostgreSQL project should use trackers to manage bugs and possibly feature requests. I have a lot of sympathy with these suggestions. But there has always been lots of pushback, along with significant disagreement about which tracker to use. Having done a bunch of work years ago to make Bugzilla platform independent, I have some fondness for it, but others hate it with a passion that seems way out of proportion to the perceived evil, so it's probably out of the question, if we ever did decide to use some sort of tracker.

Meanwhile, I encountered the Perl community's use of a tracker yesterday. David Wheeler encouraged me to file a bug about the recently discovered misbehaviour of a documented piece of the perl API. Accordingly, I ran perlbug and it told me that it had sent the bug report. Later he asked me if I had done so, as he couldn't find the report. First black mark. Then I went and looked at the tracker's web interface. What I saw was just horrible. For perl 5 there are 259 "new" bugs, (the oldest 8 years old, which isn't "new" in my book) and 1149 open bugs. And my bug, which their own program told me had been successfully submitted, didn't seem to be there. What's the point in having such a system? It's worse than useless. Trackers require effort to maintain. As I remarked to David, it's no wonder that there is significant resistance to using them when we have horrible examples like this one.

↧

Mark Wong: PDXPUG: January Meeting

January 6, 2012, 5:44 pm

≫ Next: Gabrielle Roth: Integrating Two Pg Databases

≪ Previous: Andrew Dunstan: How not to use a bug tracker

When: 7-9pm Thu January 19, 2012
Where: Iovation
Who: Tim Bruce
What: Database Trending

Tim started a new job earlier this year and was looking at how to gather some performance metrics to measure database performance over long periods. While these will change since he works in a very fluid environment, being able to see some differences over time will allow for some forecasting as well. Tim will start by showing some of the things he’s identified, and the code he has, to track these elements and then will lead a round-robin discussion of what other people may be doing to
capture their own metrics.

Tim Bruce has been working with computers for over 25 years doing Data Management, Programming, Systems Administration and Database Administration. Currently, Tim is the Database Administrator for Conducive Technology, the company behind FlightStats.

Our meeting will be held at Iovation, on the 32nd floor of the US Bancorp Tower at 111 SW 5th (5th & Oak). It’s right on the Green & Yellow Max lines. Underground bike parking is available in the parking garage; outdoors all around the block in the usual spots. No bikes in the office, sorry!

Building security will close access to the floor at 7:30.

After-meeting beer location TBD. See you there!

↧

Gabrielle Roth: Integrating Two Pg Databases

January 6, 2012, 7:44 pm

≫ Next: Andrew Dunstan: When things are different they are not the same

≪ Previous: Mark Wong: PDXPUG: January Meeting

A while back, I needed to store two dramatically different data sets: our equipment inventory and our user login history. So I set up a separate database for each. “It seemed like a good idea at the time.” Business rules changed (as they do) and I now have a need to integrate data from both [...]

↧

Andrew Dunstan: When things are different they are not the same

January 8, 2012, 9:33 am

≫ Next: Leo Hsu and Regina Obe: The wonders of Any Element

≪ Previous: Gabrielle Roth: Integrating Two Pg Databases

A customer asked me how to avoid statements timing out. Knowing that they use connection pooling, my answer was:

begin;
set local statement_timeout = 0;
select long_running_function();
commit;

But they reported that it didn't work. I got it working on exactly the machine, database and account they were using. Before reading further, see if you can guess why it worked for me and not for them. Imagine the Jeopardy music playing while you think. You have 30 seconds.

...

OK, I'll tell you. Their client program submitted this whole block as one statement, presumably using libpq's PQexec() function. So the statement timeout in effect at the time the statement was submitted applied to the whole collection. My client breaks up multiple statements and sends them to the backend one at a time. So the code worked as expected for me. When I understood what was happening, I got them to do:

set session statement_timeout = 0;

and send that to the server before sending their long running query, and suddenly everything worked as expected. This was safe to do because, although their main application uses connection pooling, the context in which they were running does not. (The pooler doesn't issue RESET copmmands.)

I hope you've guessed by now that the clients in question were psql for me and pgadmin for my customer. I find pgadmin a good tool for exploring a large database schema, but I much prefer to use psql as a tool for running queries, for just these sorts of reasons.

↧