Guillaume (ioguix) de Rorthais: Using pgBagder and logsaw for scheduled reports

August 6, 2012, 7:17 am

≫ Next: Selena Deckelmann: Los Angeles Meetup Group formed!

≪ Previous: Chris Travers: Still pushing for 1.4 beta 1 by end of month

Hey,

While waiting for next version of pgBadger, here is a tip to create scheduled pgBadger report. For this demo, I'll suppose :

we have PostgreSQL's log files in "/var/log/pgsql"
we want to produce a weekly report using pgBadger with the "postgres" system user...
...so we keep at least 8 days of log

You will need pgbadger and logsaw. Both tools are under BSD license and pure perl script with no dependencies.

"logsaw" is a tool aimed to parse log files, looking for some regexp-matching lines, printing them to standard output and remembering where it stop last time. At start, it searches for the last line it parsed on the previous call, and starts working from there. Yes, for those familiar with "tail_n_mail", it does the same thing, but without the mail and report processing part. Moreover (not sure about "tail_n_mail"), it supports rotation and compression of log files. Thanks to this tool, we'll be able to create new reports from where the last one finished !

We need to create a simple configuration file for logsaw:

$ cat <<EOF > ~postgres/.logsaw
LOGDIR=/var/log/pgsql/
EOF

There's two more optional parameters in this configuration file you might want to know:

if you want to process some particular files, you can use the "LOGFILES" parameter, a regular expression to filter files in the "LOGDIR" folder. When not defined, the default empty string means: «take all files in the folder».
if your log files are compressed, you can use the "PAGER" parameter which should be a command that uncompress your log files to standard output, eg. "PAGER=zcat -f". Note that if available, "logsaw" will use silently "IO::Zlib" to read your compressed files.
you can filter extracted lines from log files using the "REGEX" parameters. You can add as many "REGEX" parameter than needed, each of them must be a regular expression. When not defined, the default empty "REGEX" array means: «match all the lines».

Each time you call "logsaw", it saves its states to your configuration file, adding parameters "FILEID" and "OFFSET".

That's it for "logsaw". See its README files for more details, options and sample configuration file.

Now, the command line to create the report:

$ logsaw | pgbadger -g -o /path/to/report-$(date +%Y%m%d).html -

You might want to add the "-f" option to pgbadger if it doesn't guess the log format itself (stderr or syslog or csv).

About creating reports on a weekly basis, let's say every sunday at 2:10am, using crontab:

  10 2 1-7 * 7  logsaw | pgbadger -g -o /path/to/report-$(date +%Y%m%d).html -

Here you go, enjoy :)

That's why I like UNIX style and spirit commands: simple single-task but powerful and complementary tools.

Wait or help for more nice features in pgBadger !

Cheers !

PS: There's another tool to deal with log files and reports you might be interested in: "logwatch"

↧

Selena Deckelmann: Los Angeles Meetup Group formed!

August 6, 2012, 1:00 am

≫ Next: Josh Berkus: PostgresXC Live Streaming at SFPUG Aug. 7

≪ Previous: Guillaume (ioguix) de Rorthais: Using pgBagder and logsaw for scheduled reports

Tweet Yesterday on IRC, a Postgres user — goodwill in #postgresql on Freenode — piped up and said he’d really like to see an Los Angeles, CA Meetup. We have a mailing list, but it’s gone a bit quiet in … Continue reading →

↧

Josh Berkus: PostgresXC Live Streaming at SFPUG Aug. 7

August 6, 2012, 10:35 am

≫ Next: Guillaume (ioguix) de Rorthais: Normalizing queries with pg_stat_statements < 9.2

≪ Previous: Selena Deckelmann: Los Angeles Meetup Group formed!

Once again, we will have SFPUG Live streaming. This month's presentation is Mason Sharp presenting PostgresXC -- the Clustered, Write-Scalable Postgres. Video will be on Justin.TV; I will try to make HD video work this time. We'll see!

Streaming video should start around 7:15PM PDT, +/- 10 minutes.

↧

Guillaume (ioguix) de Rorthais: Normalizing queries with pg_stat_statements < 9.2

August 6, 2012, 11:28 am

≫ Next: Bruce Momjian: Monitoring Postgres from the Command Line

≪ Previous: Josh Berkus: PostgresXC Live Streaming at SFPUG Aug. 7

Hey,

If you follow PostgreSQL's development or Depesz' blog, you might know that "pg_stat_statement" extension is getting a lot of improvement in 9.2 and especially is able to «lump "similar" queries together». I will not re-phrase here what Despsz already explain on his blog.

So, we have this great feature in 9.2, but what about previous release ? Until 9.1, "pg_stat_statement" is keeping track of most frequent queries individually. No normalization, nothing. It's been a while I've been thinking about importing pgFouine/pgBadger normalization code in SQL. Next pieces of code are tested under PostgreSQL 9.1 but should be easy to port to previous versions. So here is the function to create (I tried my best to keep it readable :-)):

CREATE OR REPLACE FUNCTION normalize_query(IN TEXT, OUT TEXT) AS $body$
  SELECT
    regexp_replace(regexp_replace(regexp_replace(regexp_replace(
    regexp_replace(regexp_replace(regexp_replace(regexp_replace(

    lower($1),
    
    -- Remove extra space, new line and tab caracters by a single space
    '[\t\s\r\n]+',                  ' ',           'g'   ),

    -- Remove string content                       
    $$\\'$$,                        '',            'g'   ),
    $$'[^']*'$$,                    $$''$$,        'g'   ),
    $$''('')+$$,                    $$''$$,        'g'   ),

    -- Remove NULL parameters                      
    '=\s*NULL',                     '=0',          'g'   ),

    -- Remove numbers                              
    '([^a-z_$-])-?([0-9]+)',        '\1'||'0',     'g'   ),

    -- Remove hexadecimal numbers                  
    '([^a-z_$-])0x[0-9a-f]{1,10}',  '\1'||'0x',    'g'   ),

    -- Remove IN values                            
    'in\s*\([''0x,\s]*\)',          'in (...)',    'g'   )
  ;
$body$
LANGUAGE SQL;

Keep in mind that I extracted these regular expressions straight from pgfouine/pgbadger. Any comment about how to make it quicker/better/simpler/whatever is appreciated !

Here the associated view to group everything according to the normalized queries :

CREATE OR REPLACE VIEW pg_stat_statements_normalized AS
SELECT userid, dbid, normalize_query(query) AS query, sum(calls) AS calls,
  sum(total_time) AS total_time, sum(rows) AS rows,
  sum(shared_blks_hit) AS shared_blks_hit,
  sum(shared_blks_read) AS shared_blks_read,
  sum(shared_blks_written) AS shared_blks_written,
  sum(local_blks_hit) AS local_blks_hit,
  sum(local_blks_read) AS local_blks_read,
  sum(local_blks_written) AS local_blks_written, 
  sum(temp_blks_read) AS temp_blks_read,
  sum(temp_blks_written) AS temp_blks_written
FROM pg_stat_statements
GROUP BY 1,2,3;

Using this function and the view, a small pgbench test (-t 30 -c 10), gives:

SELECT round(total_time::numeric/calls, 2) AS avg_time, calls, 
  round(total_time::numeric, 2) AS total_time, rows, query 
FROM pg_stat_statements_normalized 
ORDER BY 1 DESC, 2 DESC;

 avg_time | calls | total_time | rows |                                               query                                               
----------+-------+------------+------+---------------------------------------------------------------------------------------------------
     0.05 |   187 |       9.86 |  187 | update pgbench_accounts set abalance = abalance + 0 where aid = 0;
     0.01 |   195 |       2.30 |  195 | update pgbench_branches set bbalance = bbalance + 0 where bid = 0;
     0.00 |   300 |       0.00 |    0 | begin;
     0.00 |   300 |       0.00 |    0 | end;
     0.00 |   196 |       0.00 |  196 | insert into pgbench_history (tid, bid, aid, delta, mtime) values (0, 0, 0, 0, current_timestamp);
     0.00 |   193 |       0.00 |  193 | select abalance from pgbench_accounts where aid = 0;
     0.00 |   183 |       0.26 |  183 | update pgbench_tellers set tbalance = tbalance + 0 where tid = 0;
     0.00 |     1 |       0.00 |    0 | truncate pgbench_history

For information, the real non-normalized "pg_stat_statement" view is 959 lines:

SELECT count(*) FROM pg_stat_statements;

 count 
-------
   959
(1 ligne)

Obvisouly, regular expression are not magic and this will never be as strict as the engine itself. But at least it helps while waiting for 9.2 in production !

Do not hesitate to report me bugs and comment to improve it !

Cheers,

↧

Bruce Momjian: Monitoring Postgres from the Command Line

August 6, 2012, 1:45 pm

≫ Next: David Fetter: PostgreSQL Archeology

≪ Previous: Guillaume (ioguix) de Rorthais: Normalizing queries with pg_stat_statements < 9.2

You might already be aware that Postgres updates the process title of all its running processes. For example, this is a Debian Linux ps display for an idle Postgres server:

postgres  2544  2543  0 10:47 ?        00:00:00 /u/pgsql/bin/postmaster -i
postgres  2546  2544  0 10:47 ?        00:00:00 postgres: checkpointer process
postgres  2547  2544  0 10:47 ?        00:00:00 postgres: writer process
postgres  2548  2544  0 10:47 ?        00:00:00 postgres: wal writer process
postgres  2558  2544  0 10:47 ?        00:00:01 postgres: autovacuum launcher process
postgres  2575  2544  0 10:47 ?        00:00:02 postgres: stats collector process

↧

David Fetter: PostgreSQL Archeology

August 7, 2012, 11:56 am

≫ Next: Selena Deckelmann: Postgres Open 2012 schedule announced!

≪ Previous: Bruce Momjian: Monitoring Postgres from the Command Line

It's happened to all of us. We're faced with a system we have only limited access to administer, and something is Not Right.

What's to do?
Continue reading "PostgreSQL Archeology"

↧

Selena Deckelmann: Postgres Open 2012 schedule announced!

August 8, 2012, 11:03 am

≫ Next: Bruce Momjian: Centralizing Connection Parameters

≪ Previous: David Fetter: PostgreSQL Archeology

TweetWe’re pleased to announce the Postgres Open 2012 schedule! A very special thanks to EnterpriseDB and Herkou for their Partner sponsorships. Please get in touch if you’d like to sponsor the conference this year! Please see a list of our … Continue reading →

↧

Bruce Momjian: Centralizing Connection Parameters

August 8, 2012, 2:00 pm

≫ Next: Leo Hsu and Regina Obe: PL/V8JS and PL/Coffee Part 2: JSON search requests

≪ Previous: Selena Deckelmann: Postgres Open 2012 schedule announced!

Hard-coding database connection parameters in application code has many downsides:

changes require application modifications
changes are hard to deploy and customize
central connection parameter management is difficult

Libpq does support the setting of connection parameters via environment variables, and this often avoids many of the down-sides of hard-coding database connection parameters. (I already covered the importance of libpq as the common Postgres connection library used by all client interfaces except jdbc.)

However, there is another libpq feature that makes connection parameter sharing even easier: pg_service.conf. This file allows you to name a group of connection parameters and reference the parameters by specifying the name when connecting. By placing this file in a network storage device, you can easily centrally-control application connections. Change the file, and every new database connection sees the changes. While you can store passwords in pg_service.conf, everyone who can access the file can see those passwords, so you would probably be better off using libpq's password file.

↧

Leo Hsu and Regina Obe: PL/V8JS and PL/Coffee Part 2: JSON search requests

August 8, 2012, 10:17 pm

≫ Next: Egor Spivac: How to get some information about PostgreSQL structure (Part 2)

≪ Previous: Bruce Momjian: Centralizing Connection Parameters

PostgreSQL 9.2 beta3 got released this week and so we inch ever closer to final in another 2 months or so. One of the great new features is the built-in JSON type and companion PLV8/PLCoffee languages that allow for easy processing of JSON objects. One of the use cases we had in mind is to take as input a JSON search request that in turn returns a JSON dataset.

We'll use our table from PLV8 and PLCoffee Upserting. Keep in mind that in practice the json search request would be generated by a client side javascript API such as our favorite JQuery, but for quick prototyping, we'll generate the request in the database with some SQL.

If you are on windows and don't have plv8 available we have PostgreSQL 9.2 64-bit and 32-bit plv8/plcoffee experimental binaries and instructions. We haven't recompiled against 9.2beta3, but our existing binaries seem to work fine on our beta3 install.

Continue reading "PL/V8JS and PL/Coffee Part 2: JSON search requests"

↧

Egor Spivac: How to get some information about PostgreSQL structure (Part 2)

August 9, 2012, 6:12 am

≫ Next: Egor Spivac: Unique Index vs Unique Constraint

≪ Previous: Leo Hsu and Regina Obe: PL/V8JS and PL/Coffee Part 2: JSON search requests

Schemas

How to get a list of schemes:

SELECT 
CASE 
    WHEN nspname LIKE E'pg\_temp\_%' THEN 1 
    WHEN (nspname LIKE E'pg\_%') THEN 0  
    ELSE 3 
END AS nsptyp, nsp.nspname, nsp.oid, pg_get_userbyid(nspowner) 
    AS namespaceowner, 
    nspacl, description,  
    has_schema_privilege(nsp.oid, 'CREATE') as cancreate 
FROM pg_namespace nsp 
LEFT OUTER JOIN pg_description des ON des.objoid=nsp.oid  
WHERE NOT ((nspname = 'pg_catalog' AND EXISTS 
(SELECT 1 FROM pg_class 
    WHERE relname = 'pg_class' 
    AND relnamespace = nsp.oid LIMIT 1)) OR  
(nspname = 'information_schema' AND 
    EXISTS (SELECT 1 FROM pg_class 
            WHERE relname = 'tables' 
            AND relnamespace = nsp.oid LIMIT 1)) OR  
(nspname LIKE '_%' AND 
    EXISTS (SELECT 1 FROM pg_proc 
            WHERE proname='slonyversion' 
            AND pronamespace = nsp.oid LIMIT 1)) OR  
(nspname = 'dbo' AND 
    EXISTS (SELECT 1 FROM pg_class 
            WHERE relname = 'systables' 
            AND relnamespace = nsp.oid LIMIT 1)) OR  
(nspname = 'sys' AND 
    EXISTS (SELECT 1 FROM pg_class 
            WHERE relname = 'all_tables' 
            AND relnamespace = nsp.oid LIMIT 1))
) 
AND nspname NOT LIKE E'pg\_temp\_%'
AND nspname NOT LIKE E'pg\_toast_temp\_%' 
ORDER BY 1, nspname

Tables

Get all tables for schema "public"

SELECT n.nspname as "Schema",  c.relname AS datname,  
CASE c.relkind 
    WHEN 'r' THEN 'table' 
    WHEN 'v' THEN 'view' 
    WHEN 'i' THEN 'index' 
    WHEN 'S' THEN 'sequence' 
    WHEN 's' THEN 'special' 
END as "Type",  u.usename as "Owner", 
(SELECT obj_description(c.oid, 'pg_class')) AS comment  
FROM pg_catalog.pg_class c 
LEFT JOIN pg_catalog.pg_user u ON u.usesysid = c.relowner 
LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace 
WHERE n.nspname='public' AND c.relkind IN ('r','') 
AND n.nspname NOT IN ('pg_catalog', 'pg_toast', 'information_schema')
ORDER BY datname ASC

Fields

Get all fields for table "table1", with additional information (type, default value, not null flag, length, comment, foreign key name, primary key name)

SELECT pg_tables.tablename, pg_attribute.attname AS field, 
    format_type(pg_attribute.atttypid, -1) AS "type", 
    pg_attribute.atttypmod AS len,
    (SELECT col_description(pg_attribute.attrelid, 
            pg_attribute.attnum)) AS comment, 
    CASE pg_attribute.attnotnull 
        WHEN false THEN 1  ELSE 0  
    END AS "notnull", 
    pg_constraint.conname AS "key", pc2.conname AS ckey, 
    (SELECT pg_attrdef.adsrc FROM pg_attrdef 
        WHERE pg_attrdef.adrelid = pg_class.oid 
        AND pg_attrdef.adnum = pg_attribute.attnum) AS def 
FROM pg_tables, pg_class 
JOIN pg_attribute ON pg_class.oid = pg_attribute.attrelid 
    AND pg_attribute.attnum > 0 
LEFT JOIN pg_constraint ON pg_constraint.contype = 'p'::"char" 
    AND pg_constraint.conrelid = pg_class.oid AND
    (pg_attribute.attnum = ANY (pg_constraint.conkey)) 
LEFT JOIN pg_constraint AS pc2 ON pc2.contype = 'f'::"char" 
    AND pc2.conrelid = pg_class.oid 
    AND (pg_attribute.attnum = ANY (pc2.conkey)) 
WHERE pg_class.relname = pg_tables.tablename  
    AND pg_tables.tableowner = "current_user"() 
    AND pg_attribute.atttypid <> 0::oid  
    AND tablename='table1' 
ORDER BY field ASC

See First part

↧

Egor Spivac: Unique Index vs Unique Constraint

August 9, 2012, 6:12 am

≫ Next: Egor Spivac: How to get some information about PostgreSQL structure (Part 1)

≪ Previous: Egor Spivac: How to get some information about PostgreSQL structure (Part 2)

Do you know the difference between unique constraints and unique indexes?
In general, there are no differences, but, you can't use unique index in a foreign key.

How to use the unique constraint:

-- Table 2
CREATE TABLE test2
(
  test2_unique integer,
  CONSTRAINT test2_unique UNIQUE (test2_unique)
)
WITH (OIDS=FALSE);
ALTER TABLE test2 OWNER TO postgres; 

-- Table 1
CREATE TABLE test1
(
  test1_unique integer NOT NULL DEFAULT 1,
  pk serial NOT NULL,
  CONSTRAINT pk_key PRIMARY KEY (pk),
  CONSTRAINT test1_test2_unique FOREIGN KEY (test1_unique)
      REFERENCES test2 (test2_unique) MATCH SIMPLE
      ON UPDATE CASCADE ON DELETE CASCADE
)
WITH (OIDS=FALSE);
ALTER TABLE test1 OWNER TO postgres;

Also, you need create an unique index for test1_unique field, for better performance:

CREATE INDEX fki_test1_test2_unique
ON test1 USING btree (test1_unique);

↧

Egor Spivac: How to get some information about PostgreSQL structure (Part 1)

August 9, 2012, 6:12 am

≫ Next: Simon Riggs: PostgreSQL: The Multi-Model Database Server

≪ Previous: Egor Spivac: Unique Index vs Unique Constraint

Types

Get Types list:

SELECT oid, format_type(oid, NULL) AS typname FROM pg_type WHERE typtype='b'

Users

Get users list:

SELECT rolname FROM pg_roles WHERE rolcanlogin ORDER BY 1

Databases

Get databases list by name of schema:

SELECT n.nspname as "Schema", c.relname as datname, CASE c.relkind

WHEN 'r' THEN 'table'

WHEN 'v' THEN 'view'

WHEN 'i' THEN 'index'

WHEN 'S' THEN 'sequence'

WHEN 's' THEN 'special'

END as "Type", u.usename as "Owner" 
FROM pg_catalog.pg_class c 
LEFT JOIN pg_catalog.pg_user u ON u.usesysid = c.relowner 
LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace 
WHERE n.nspname='YourSchemaName' AND c.relkind IN ('r','')

AND n.nspname NOT IN ('pg_catalog', 'pg_toast', 'information_schema')

See next part

↧

Simon Riggs: PostgreSQL: The Multi-Model Database Server

August 9, 2012, 11:57 am

≫ Next: Josh Berkus: See you in Chicago!

≪ Previous: Egor Spivac: How to get some information about PostgreSQL structure (Part 1)

I'd like to change the way we describe PostgreSQL.

Calling PostgreSQL and Object Relational database is misleading and years out of date. Yes, PostgreSQL is Relational and the project follows the SQL Standard very closely, but that's not all it does.

PostgreSQL supports all of the following:

* Relational
* Object Relational
* Nested Relational (record types)
* Array Store
* Key-Value Store (hstore)
* Document Store (XML, JSON)

and 9.2 adds

* Range Types

So what do we call it?

We support multiple models, so I guess we should call it a "Multi-Model Database".

The good thing here is that we support them all, in one platform, allowing you to join data together no matter what shape the data is held in.

Which means that PostgreSQL is a great General Purpose database and a great default choice for use by applications. Stonebraker has spoken out against the idea of a General Purpose database, but his interest is in bringing VC-funded startups to market, not in supporting production systems and catering to a range of business requirements with flexibility and speed. The reality is that if you pick a specialised database that fits your current requirements you're completely stuck when things change, like they always do.

↧

Josh Berkus: See you in Chicago!

August 9, 2012, 9:48 pm

≫ Next: Chris Travers: A Software Architect's View of the Design of Double Entry Paper Accounting Systems

≪ Previous: Simon Riggs: PostgreSQL: The Multi-Model Database Server

Just bought my plane tickets for Postgres Open in Chicago. I'm really looking forward to this one, which will be even more user/application-developer-oriented than the first Postgres Open. We have a keynote by Jacob Kaplan-Moss, the founder of Django, and talks by staff from Heroku, Engineyard, Evergreen, and Paul Ramsey of PostGIS, as well as some of the usual suspects. Register and buy a plane ticket now! There's still time!

I'll be presenting an updated version of my PostgreSQL for data warehousing talk, Super Jumbo Deluxe. My coworker Christophe will do PostgreSQL When It's Not Your Job.

Chicago is a great city to visit, too, and September is a good time to be there weather-wise. It's generally sunny and pleasant but not too warm, and the flying spiders are gone. There's tons of museums and world-renowned restaurants. In fact, I'm bringing Kris this year.

Oh, and there's still Sponsorship slots open, hint, hint.

↧

Chris Travers: A Software Architect's View of the Design of Double Entry Paper Accounting Systems

August 10, 2012, 2:37 am

≫ Next: Leo Hsu and Regina Obe: PLV8JS and PLCoffee Part 2B: PHP JQuery App

≪ Previous: Josh Berkus: See you in Chicago!

Taking a brief break from the computer side for now, I figured it would be worth describing the basic design considerations of double entry. This post is the result of my work studying history and anthropology far more than working on LedgerSMB but of course working on the software played a role too. Most of what is presented here is my own original research.

Also I am sure that accounting students or those who have studied accounting in school will find some aspects of this challenging since I explore some areas that accountants are not taught about and because, as a history buff, I refuse to believe that certain designations are arbitrary.

I think that the apparent evolution of double entry accounting shows a number of potential approaches for dealing with the very difficult distributed transaction issues of today, not so much in what to do but where to look for answers.

The Tally Stick System and the Origins of Double Entry

In medieval Europe there were two basic economic constraints. First most people could not read or write or perform arithmetic on paper, which limited the forms of exchange possible. Secondly there was a perpetual shortage of currency which meant that official money was not really possible to use money as a simple medium of exchange. These problems were resolved by the development of the split tally stick. The split tally stick evolved through the Middle Ages into a very highly developed system and was still in active use in some places into the 19th century. Ironically it was an effort to destroy these sticks held by the British Government in 1836 which burned down both houses of parliament.

At its fully fully developed form, a split tally stick consisted of a stick, usually of hazel, which had notches carved in it. The stick would be split length wise and one side would be cut short. The long side would be called the stock (trunk) and represented the debt. It would be held by the creditor. The short side, called the foil ('leaf') would follow the loan.

The creditor, at the allotted day, could show up and present his stock and demand payment. If the creditor tried to pad the debt by adding more notches, this would be immediately apparent, and there was no way to erase notches on the other side. Additionally it was immediately apparent if the sides were from different tally sticks.

As the Middle Ages progressed, literacy became somewhat more widespread. From the days of the Merovingian king, Charles Martel, through the "Carolingian Renaissance" in to the high Middle Ages, literacy began to become more open beginning with kings and eventually being available to the children of wealthy merchants. With literacy came basic knowledge of arithmetic and eventually algebra, and the efforts to track tally sticks on paper may have given rise to double entry systems of bookkeeping.

The first step of course is a journal. Here the tally sticks could be recorded. The bigger stocks (like bigger numbers) would be reported on the left, while the smaller foils would be reported on the right. One could then get a quick breakdown of one's position in terms of debt collectable vs debt owing. Within a few hundred years, this develops into general ledgers and the full double entry accounting system that has not changed in its outlines since Luca Pacioli wrote about it in the 15th century.

However it is worth noting that split tally sticks are inherently double entry in the sense that every debit has a corresponding credit, and they are inherently accrual since income and formal payment could not be as well correlated as could invoice and income. Moreover the rules for stock and foil appear to have tracked the rules for debit and credit today.

It is finally worth noting that even in Pacioli's day, and even a hundred years later, tally sticks were in common use in this way. Therefore, it seems to me that Pacioli himself was probably familiar with these and so his use of Latin terms corresponds almost certainly to the tally stick approach.

Business as Owning Nothing for Itself

Many of the principles I have found while investigating the origins of double entry accounting are challenging to us today and force us to think about businesses differently. Exploring the development of designs of older functional systems also better prepares us to think about design ourselves in areas of software where we are building complex functional systems today.

In a double entry accounting system, a business owns nothing for itself. Everything it owns, it owes to its owners. This is why the books always balance and why debits always equal credits. Corporations are a legal fiction which post-date the development of double-entry accounting systems but even they don't own things for themselves. Their equity is owned entirely by their owners. A corporation owns things only on behalf of its shareholders. Limited liability only means that the effective equity cannot drop below 0 and a corporation has no possessory interest distinct from those of its stockholders.

This understanding is also behind the fundamental accounting equation that assets - liabilities = equity.

As long as the books are guaranteed to balance, then it is possible to detect errors by balancing the books (using a trial balance, as Pacioli suggests). This is only possible if the intrinsic net worth of the business is always 0 (when we talk about the net worth of the business, we are talking about the extrinsic net worth, namely the equity balance, see below, but this is actually the net balance of debt owed by the business to its owners--- while an individual may on balance be worth a million dollars, this means something very different than if a business is worth a million dollars because the business, unlike the individual, owes that money to someone).

While stock and foil tally sticks were originally used to track debt, Henry I of England required that they be used for receipts for taxes in the 11th century. The process followed the pre-existing use, where the stock follows the person who has given money, while the foil was retained by the exchequer as a receipt of money retained. The continued use of this sort suggests that the basic principles of double entry accounting were already known to some extent. If tax liability is a debt owed, then the payment is a debit and one receives a stock, while the foil follows the money given.

It is my belief that the basic principles of double entry accounting were explored first with tally sticks and later, as literacy became somewhat more widespread, on paper.

What Accounting Systems are Designed to Do

Accounting systems are designed to do one thing only, to track who owes what to whom. Even tracking of current assets is a part of that since all assets of a business are effectively owed to the owners.

This may sound overly simplistic but the anthropological assessment on the origin of money is that money itself arises as a way to quantify debt. Debt pre-exists money systems, and all economies are powered by debt, and this is particularly noticeable in gift economies where debt, in the form of honor, is the primary currency. Everywhere debt precedes payment, and everywhere debt precedes money. Therefore not only is accrual-basis accounting better in terms of reporting, but it better represents what actually is happening economically.

Debits and Credits

In the 15th Century, Luca Pacioli wrote his now-famous book on arithmetic, including a section on double-entry accounting. Pacioli did not invent the system. Instead his description is that of the Venetian system which was already in use at the time. In general, the design is indicative of accrual-based accounting however in part because assets include debts owed to the business. This actually is clearer when looking at the Latin terms Pacioli uses in his descriptions and how these fit together, Namely he uses the terms "debit" and "credit" to refer to financial units which are clearer in Latin than they are in English and much of our accounting terminology derives from his work.

Pacioli's use of debit and credit are specifically of distinct concepts, and when every accounting student is taught these are arbitrary, every accounting student is taught wrong.

In Latin, debit refers to something which is owed and indeed it leads to our modern English word debt (via Old French). The word derives from early Latin roots meaning "to take away something you have" and so it denotes a loss of a possessory interest in something.

A credit is the opposite side of a debit. A debit is something owed. A credit (from credere, to believe or trust, related to "creed" in Modern English) is something entrusted to someone else or loaned to them. This term denotes a continued possessory interest in something, even as it is entrusted to someone else. So we can most simply translate debit and credit as debt and investment or loan, respectively. Being opposites, a credit abates a debit and vice versa.

Also being opposite sides of the transaction every transaction inherently balances. One person's debit must by nature correspond to someone else's credit. By entering everything from the perspective of the counterparty, whether customer, vendor, or owner, the books will be guaranteed to balance, and this allows one to detect errors because the total intrinsic value of the business will remain 0.

For this reason both debits and credits have two distinct meanings. One is to off-set the other (paying back a loan is a debit against the corresponding credit), and the other is to represent debts (debits) and assets entrusted to the business but possessed by others (credits). Every type of account furthermore has a specific type of counter-party but these fall into two categories: owners (equity, and change-in-equity accounts, namely accounts for tracking income and expenses) and non-owners (assets and liabilities).

From this point we can derive the normal balance of every account because we understand the basic structure and functions of the system:

Asset accounts track debt and debt payments by those who owe money to the business. Therefore they use debits for positive balances. Being debits, they can be used to pay off loans made to the business (credits) or investors (investments are credits). The perspective is that of the debtor.
Liability accounts track loans and loan payments to those the business owes money to. Therefore they use credits for positive balances. The perspective is that of the lender.
Equity accounts track investments in a business which reflect the value of the business to the owners. Therefore they use credits for positive balances. The perspective is that of the owner.
Income accounts track positive changes in equity. Therefore they use credits for positive balances. The perspective is that of the owner.
Expense accounts track negative changes in equity. Therefore they use debits for positive balances. The perspective is that of the owner.

What Software Engineers can Learn from Accounting Systems

Most of the time, those of us who design and write software find that our approaches are relatively orderly but fragile. Aspects of a system fail to support each other, and we basically add complexity to the system in order to hold it together. Highly engineered systems are thus brittle, or to the extent that they are not, require layer upon layer of complexity in order to keep working. These systems often cannot continue to be maintained properly once people have forgotten why certain design decisions were made in the first place.

In contrast, double entry accounting systems (particularly accrual-based systems), which probably arose organically during the Late Middle Ages until they became important enough for Pacioli to write about, is a highly evolved system. The overall approach, while difficult to grasp at first, can be maintained without any understanding of the reasons behind the design decisions. Accounting systems have further evolved from Pacioli's day, presumably through the same process they evolved before then. People understand, in general terms, principles required to get meaningful information out of the system, are confronted by new problems, and respond by experimenting and sharing results, until new approaches take root.

One thing we as engineers can do is look to some of these highly evolved systems, and how they change over time, and recognize that they show us what may be a better way to create highly robust systems that tolerate and detect errors well. In double entry accounting, we look to owners 'interest in the business vs the business's interest in everyone else's assets and make sure they are equal. This adds redundancy in entering information but it also adds richness in reporting that is not possible otherwise. Perhaps there are opportunities for things like this elsewhere.

Here, with double entry accounting, the perspective chosen, namely the value of a business to itself, is an easy one to check. Such a business will inherently have no value to itself. Choice of perspective makes some difficult problems (like catching data entry errors when tracking money) relatively simple to catch, locate, and correct. Sometimes the most elegant solutions are not in what you do but how you look at things. This is very clear when it comes to this specific sort of system.

The real challenge going forward in my view is how we look at distributed systems and what perspectives we can find which make these problems elegantly solvable. As with double entry accounting, this will probably have to arise from looking at pre-computing methods, for example petty cash management as a basis for a guarantee of eventual consistency in a distributed transaction. Current approaches like those adopted by the NoSQL community don't get us there. Approaches that draw from our experiences doing non-distributed transactions well as well as paper forms of distributed transactions may, however provide something a lot more robust and usable.

↧

Leo Hsu and Regina Obe: PLV8JS and PLCoffee Part 2B: PHP JQuery App

August 10, 2012, 10:32 am

≫ Next: Josh Berkus: MySQL-to-PostgreSQL Migration Data from The451.com

≪ Previous: Chris Travers: A Software Architect's View of the Design of Double Entry Paper Accounting Systems

In our last article, PL/V8JS and PL/Coffee JSON search requests we demonstrated how to create a PostgreSQL PL/Javascript stored function that takes as input, a json wrapped search request. We generated the search request using PostgreSQL. As mentioned, in practice, the json search request would be generated by a client side javascript API such as JQuery. This time we'll put our stored function to use in a real web app built using PHP and JQuery. The PHP part is fairly minimalistic just involving a call to the database and return a single row back. Normally we use a database abstraction layer such as ADODB or PearDB, but this is so simple that we are just going to use the raw PHP PostgreSQL connection library directly. This example requires PHP 5.1+ since it uses the pg_query_param function introduced in PHP 5.1. Most of the work is happening in the JQuery client side tier and the database part we already saw. That said the PHP part is fairly trivial to swap out with something like ASP.NET and most other web server side languages.

Continue reading "PLV8JS and PLCoffee Part 2B: PHP JQuery App"

↧

Josh Berkus: MySQL-to-PostgreSQL Migration Data from The451.com

August 10, 2012, 3:35 pm

≫ Next: Selena Deckelmann: LA Postgres first meeting is on for Tuesday, Aug 28!

≪ Previous: Leo Hsu and Regina Obe: PLV8JS and PLCoffee Part 2B: PHP JQuery App

As you know, due to PostgreSQL's wide redistribution, worldwide user base, and liberal licensing policies (i.e. no registration required), hard data on PostgreSQL adoption is somewhat hard to come by. That's why I'm very grateful to The451.com, an open-source-friendly analyst service, for sharing with me the PostgreSQL-relevant contents of their report, MySQL vs. NoSQL and NewSQL: 2011-2015.

The451.com is no stranger to open source databases; their analytics services are built using PostgreSQL, MySQL and probably others. So their analysis is a bit more on-target than other analysts (not that I would be thinking of anyone in particular) who are still treating open source databases as a fringe alternative. Matt Aslett is their database and information storage technology guru.

The part of their report I found the most interesting was this part:

In fact, despite significant interest in NoSQL and NewSQL products, almost as many MySQL users had deployed PostgreSQL as a direct replacement for MySQL (17.6%) than all of the NoSQL and NewSQL databases combined (20%).

I'd some idea that we were seeing a lot of post-Oracle-acquisition MySQL refugees in the PostgreSQL community. However, given the strong appeal that the new databases have for web developers, I'd assumed that three times as many developers were going to non-relational solutions as were to PostgreSQL. I'm pleasantly surprised to see these figures from the451.com.

Of course, this means that us PostgreSQL folks really need to work even harder on being welcoming to former MySQL users. They're our users now, and if they have funny ideas about SQL syntax, we need to be encouraging and helpful rather than critical. And, above all, not bash MySQL.

The summary report also has the first solid figure I've seen on PostgreSQL market share in a while (hint: it's higher than 10%). You can access it here. To access the full base report, you need to apply for a full free trial membership. As a warning, that free trial registration feeds into their sales reps, so don't be surprised if you get a call later.

↧

Selena Deckelmann: LA Postgres first meeting is on for Tuesday, Aug 28!

August 11, 2012, 10:10 pm

≫ Next: Leo Hsu and Regina Obe: Schemas vs. Schemaless structures and The PostgreSQL Type Farm

≪ Previous: Josh Berkus: MySQL-to-PostgreSQL Migration Data from The451.com

TweetThe meeting is scheduled for Tuesday, August 28, at 7:30pm at 701 Santa Monica Blvd Suite 310, Santa Monica, CA. From the latest posting on the Meetup group: Beer and Stories We huffed and we puffed and now we got … Continue reading →

↧

Leo Hsu and Regina Obe: Schemas vs. Schemaless structures and The PostgreSQL Type Farm

August 12, 2012, 1:37 am

≫ Next: Chris Travers: ACID as a basic building block of eventually consistent, distributed transactions

≪ Previous: Selena Deckelmann: LA Postgres first meeting is on for Tuesday, Aug 28!

There has been a lot of talk lately about schemaless models touted by NoSQL groups and how PostgreSQL fits into this New world order. Is PostgreSQL Object-Relational? Is it Multi-Model. We tend to think of PostgreSQL as type liberal and it's liberalness gets more liberal with each new release. PostgreSQL is fundamentally relational, but has little bias about what data types go into columns of related tables. One of PostgreSQL great strengths is the ease with which different types can coexist in the same table and the flexible index plumbing and plan optimizer it provides that allows each type regardless of how weird to take full advantage of various index strategies and bindings. Our 3 favorite custom non-built-in types we use in our workflow are PostGIS (of course), LTree (Hierarchical Type), and HStore (Key-Value type). In some cases, we may use all 3 in the same database and sometimes the same table - where we use PostGIS for spatial location, LTree for logical location, and Hstore just to keep track of random facts about an object that are easier to access but are too random to columnizing. Sometimes we are guilty of using xml as well when we haven't figured out what schema model best fits a piece of data and hstore is too flat of a type to work. The advent of JSON in PostgreSQL 9.2 does provide for a nested schemaless model similar to what the XML type offers, but more JavaScript friendly. I personally see JSON as more of a useful transport type than one I'd build my business around or a type you'd use when you haven't figured out what if any structure is most suitable for your data. When you have no clue what structure a piece of data should be stored, you should let the data tell you what structure it wants to be stored in and only then will you discover by storing it in a somewhat liberal fashion how best to retrofit in a more structural self-descriptive manner. Schemas are great because they are self-describing, but they are not great when your data does not want to sit in self-described bucket. You may find in the end that some data is just wild and refuses to stay between the lines and then by all means stuff it in xml or json or create a whole new type for it.

↧

Chris Travers: ACID as a basic building block of eventually consistent, distributed transactions

August 13, 2012, 2:47 am

≫ Next: Egor Spivac: How to get some information about PostgreSQL structure (Part 3)

≪ Previous: Leo Hsu and Regina Obe: Schemas vs. Schemaless structures and The PostgreSQL Type Farm

Anything worth doing is worth doing well. It therefore follows that anything worth tracking for a business is worth tracking well. While this involves tradeoffs which are necessarily business decisions, consistency and accuracy of data are always important considerations.

In a previous post, I looked at a the the possible origin of double entry accounting in stock and foil split tally sticks. Now let's look at how financial systems of today might provide a different way of looking at distributed transactions in loosely coupled systems. This approach recognizes that in human systems all knowledge is fundamentally local and applies this to distributed computing environments, broadly defined.

Eventual Consistency as Financial Firewall

In the non-computer world there are very few examples of tightly coupled distributed transactions. Dancing, perhaps comes to mind as does a string quartet playing a piece of music. However most important work is done using loosely coupled systems. Loosely coupled systems done right are more robust and avoid some of the problems associated with the CAP theorem. If the first violinist suffers a mishap part way through a quartet and is unable to continue, you probably will have to stop playing. If the petty cash manager suddenly falls ill after you have withdrawn money from the petty cash drawer to go buy urgently needed office supplies you can continue on your way. You might have to wait until someone else can take over before you can give your receipt and change back, however.

Such systems however have two things in common: first they are always locally consistent and this is extremely important. The petty cash drawer and all cash vouchers are together in a consistent state. Secondly counterparties are all locally consistent and transactions can be tied clearly back to such counterparties. The party and counterparty together provide a basis for long-term conflict resolution in the form of an audit. Eventual consistency is a property of the global system, not of any local component.

Moreover there are times when globally eventual consistency is actually desirable, but this business need is premised on absolute local consistency. For example , if I am processing credit cards, the processing systems will be locally consistent and my accounting systems will be locally consistent, but these will not always be in sync. My accounting department will have control over the data coming into the books. Similarly if my inventory is stored by a third party and shipped, they may send me a report every day, but it is not going to hit the books until it is reviewed by a person. Both of these approaches use global eventual consistency as a firewall against bad financial data entering the books. In other words this is a control point where humans can review and correct the problems. In addition to the performance issues, this is a major reason why you will rarely see two phase commit used to synchronize transactions between financial accounting systems. Instead these will be moved in, in ways which are not globally consistent but only can become such after human review and approval. The computer, in essence, is treated like a person which can make mistakes or worse.

If this is the case internally it is even more the case between businesses. If I run a bank, I am not giving your bank direct access to my financial database, and I am not going to touch your database. The need for eventual consistency in transferring money between our banks will have to take the form of messages exchanged, human review, and more.

Why SQL? Why ACID? Why not a NoSQL ERP?

I have often said that NoSQL is an extraordinarily poor choice for ERP software. In addition to the difficulties in doing ad hoc reporting, you have a need for absolute, local consistency which is not a goal of NoSQL software. Without local consistency, you can't audit your books, and you cannot determine where something went wrong. ACID is not a global property of the business IT infrastructure and it shouldn't be (if you try to make it you run into a dreaded brick wall which is called "The CAP Theorem." It is a property of local data stores and all your internal controls are based on the assumption that data is locally consistent and knowledge is local.

Example 1: An Eventually Consistent Cash Register

The first example of how we might look at this might be an eventually consistent retail environment. In this environment we aren't taking materials out of the store to ship, or if we do we are going down and pulling them off the shelf before entry. The inventory on the shelf is thus authoritative here. The books are also reviewed every day and transactions reviewed/posted.

Since nobody can pull a product off the shelf that doesn't exist, we don't have to worry about real-time stock tracking. If we did though there would be ways to handle this. See example 2 below.

In this example the cash register would locally be running an ACID-compliant database engine and store the data through the day locally. At the end of the day it would export the transactions it did to the accounting system where the batch would be reviewed by a person before posting to the books. Both the cash register and the accounting system would retain records of the transfer and synchronization, making an audit possible. Because of the need for disconnected operation I am considering building such a cash register for LedgerSMB 1.5

Example 2: Real-time inventory tracking for said cash register.

So business needs change, and now the disconnected cash register needs to report inventory changes on as real-time basis as possible to the main ERP system. This is made easier if the cash register is running PostgreSQL.

So we add a trigger to the table that stores the inventory movements which queues these for processing. A trigger on the queue table issues a NOTIFY to another program which attempts to contact the ERP system. It sends info on the inventory plus invoice number. This is digitally signed with a key for the cash register. The ERP stores this information and uses it for reporting and reconciliation at the end of the day, but treats it as a signed voucher checking out inventory. At the end of the day these are reconciled and errors flagged. If the message exchange fails, it tries again later.

Now, here you have three locally consistent systems: the POS, the ERP, and the messaging module. Eventual consistency is a global property arising from it, and is preferable, business-wise, to absolute consistency because it gives an opportunity for human oversight.

These systems draw their inspiration from the paper accounting world. They are partition-tolerant, available, and guaranteed, absent hardware destruction, to be eventually consistent. Not only does this get around the problems flagged in the CAP theorem but they also provide opportunities for humans to be in control of the business.

Unlike BASE systems, these systems, built on ACID provide a basic framework where business process controls are consistently enforced.

Considerations

For such a system to work a few very specific requirements must be met (ACID allows us to meet them but we still have to design them into a system):

Messages must be durable on the receiving side.
Messages must be reproducible on the sending side.,
Each side must be absolutely consistent at all times.
Sender and receiver do not require knowledge of the operations handled by the other side, just of the syntax of the messages. The sender must have knowledge that the message was received but need not have knowledge that it was durably stored (such knowledge can be helpful but it is not required.
Humans, not machines, must supervise the overall operation of the system.

Summary and thoughts on the CAP theorem

In the off-line world, the CAP theorem not only enforces real limits in coordinated activities, but it also provides solutions. Individuals are entrusted with autonomy, but controls are generally put in place to ensure both local consistency and the ability to guarantee eventual consistency down the road. Rather than building on the BASE type approach, these build on the fundamental requirements of ACID compliance. Paper forms are self-contained, and are atomic for practical reasons. They maintain consistent state. The actor's snapshots of information are inherently limited regarding what information they receive about the state on the other side, providing something akin to isolation. And finally the whole point of a paper trail requires some degree of durability. Thus the BASE approach resembles the global system but all components are in fact ACID-compliant.

This provides a different way to think of distributed computing, namely that functional partitions are sometimes desirable and can often be made useful, thus allowing one to move from the CAP theorem as a hard limit to viewing the CAP theorem as a useful reminder of what is inherently true, that loosely coupled systems create more robust global environments than tightly coupled ones, and that people rather than computers are often necessary for conflict resolution and business process enforcement. The basic promise is that global consistency will be eventually maintained, and that the system will continue to offer basic services despite partitions that form. This is the sort of thing the BASE/NoSQL proponents are keen on offering, but their solutions make it difficult to enforce business requirements because individual applications may see eventual consistency. Here each application is absolutely consistent, but the entire environment is eventually consistent. This matches off-line realities more closely.

Instead human systems, perhaps because they are not scalable in the CAP sense (having limited communications bandwidth), have developed very sophisticated systems of eventual consistency. These systems require absolute consistency on the node (person) level and then coordination and reconciliation between systems. In my accounting work the approach I have usually taken is to put the human in control but do whatever you can to simplify and streamline the human's workflow. For example, when trying to reconcile a checking account it may be useful to get data from the bank. This data is then matched to what's in the database using a best guess approach (in some cases we can match on check number but for wires and transfers we cannot and so we guess based on date and amount) and the human is left to resolve the inevitable differences as well as review and approve.

The nice thing about this model (and it differs from BASE considerably), is that you can expect absolute consistency at all times on every node, and availability of the global environment does not depend on every individual component functioning. In the cash register example, the cash register can go down without the ERP going down and vice versa. This means that you can achieve basic availability and eventual consistency without sacrificing ACID on a component level. The key however is to be able to go back to the components and be able to generate a transaction history if needed, so if something failed to come through you can re-run it.

Because this is based on the ACID rather than the BASE model, I would therefore offer as a cute name: Locally Available and Consistent Transaction and Integrity Control ACID as a name for this model of consistency. This of course can be referred to by the cute shorthand of LACTIC ACID, or better "the LACTIC ACID model of eventual consistency." It is a local knowledge model, rather than the more typical global knowledge model, and assumes that components, like people, have knowledge only of the things they need to know, that they are capable of functioning independently in disconnected mode, and that they are capable of being generating consistent, and accurate, pictures of the business later on demand.

This approach further relegates traditional tools like two-phase commit to the role of tools, which can be very helpful in some environments (particularly replication for high availability), but are not needed to ensure consistency of the environment when the systems are loosely coupled. They may still be helpful in order to handle some sorts of errors, but they are one tool among many.

Finally although financial systems are not transportation vehicles, the automation paradox applies there as well. If humans rely on too much automation, they are inclined to trust it, and therefore when problems arise are ill-prepared to deal with them. In this approach, humans are integral parts of the operation of the distributed computing environment and therefore are never out of the loop. This may sound less than desirable but in my experience every case of embezzlement I have heard of could have been prevented by more human eyes on the books and less, rather than more, immediate consistency. There is nothing more consistent than having a single employee with full access to the money. Separation of duties implies less consistency between human actors but this is why it is useful.

This approach is intended to be evolutionary rather than revolutionary. Rather than try to create something new, it is an attempt to existing mature processes and apply them in new ways.

↧