Working in the background for fun and profit
The background worker facility is one of those cool open ended features in Postgre 9.3 with many possibilities. This post shares some thoughts on potential interesting use cases for the facility with implementation details from developing a prototype background worker: ALPS.
What are good tasks for BG workers?
The background worker feature allows you to register a program with postmaster. As a result the process runs as a child of postmaster, and the process lifecycles are closely related. It makes sense then, that many of the default background worker processes are closely related to the operation of the database. In other words:
bg workers are good for programs you want running whenever your database is running
Consider autovacuum, bgwriter, checkpointer…, most or perhaps all of the built-in background workers also deal with spreading out the load caused by resource intensive tasks. By removing these tasks from normal backend operations load can spread with more flexibility.
background workers are good for tasks that spread resources intensive taks out over time.
Of course, there are surely other motivations for using background workers. A background worker would be perfect for channeling half cents from numeric columns into a secret account like in Superman 3.
This never would have worked if done at all at once.
Prototype BG worker: ALPS
ALPS (Automatic Linear Prototyping System) is a background worker which builds prototype shotgun linear models to predict numeric and boolean types based on potential support columns in the same table. It uses the linear and logistic regressions functions from the MADlib extension.
In other words, it’s a process which starts when postmaster starts, stops when postmaster stops, polls table analyze information (similar to autovacuum), and generates fields which are linear model predictions for every numeric and boolean column in a database. This seems to fit the background worker model pretty well.
To demo alps we first create a database with some data well suited for the demonstration, a table from R’s data packages in sql format.
wget http://no0p.github.io/images/rdata.sql
createdb demo
psql -d demo < rdata.sql
We’ll be looking at the mtcars data which is a table of data about cars, like horsepower, number of cylinders, number of axles, gas mileage, and transmission type.
# \d rdata.mtcars
Table "rdata.mtcars"
Column | Type | Modifiers
--------+---------+-----------------------------------------------------------
id | integer | not null default nextval('rdata.mtcars_id_seq'::regclass)
mpg | numeric |
cyl | numeric |
disp | numeric |
hp | numeric |
drat | numeric |
wt | numeric |
qsec | numeric |
vs | numeric |
am | boolean |
gear | numeric |
carb | numeric |
Indexes:
"mtcars_pkey" PRIMARY KEY, btree (id)
Assuming we have installed ALPS, the schema for this table will be modified to include additional columns which are the column names with a suffix of __predicted. A linear model will be trained for and will prepopulate values. It will be transformed into:
# \d rdata.mtcars
Table "rdata.mtcars"
Column | Type | Modifiers
-----------------+---------+-----------------------------------------------------------
id | integer | not null default nextval('rdata.mtcars_id_seq'::regclass)
mpg | numeric |
cyl | numeric |
disp | numeric |
hp | numeric |
drat | numeric |
wt | numeric |
qsec | numeric |
vs | numeric |
am | numeric |
gear | numeric |
carb | numeric |
am__predicted | numeric |
carb__predicted | numeric |
cyl__predicted | numeric |
disp__predicted | numeric |
drat__predicted | numeric |
gear__predicted | numeric |
hp__predicted | numeric |
Indexes:
"mtcars_pkey" PRIMARY KEY, btree (id)
This is of course not particularly useful for this data, as there are no nulls. However if we had large gaps in the column am (a boolean for automatic transmission) we could now easily get predictions as follows.
select coalesce(am, am__predicted) from rdata.mtcars;
Let’s consider how this might be useful by trying to figure out the transmission type of my primary two combustion vehicles.
Let’s start by taking a look at our odds ratios. Alps stores models in an alps schema with models scoped by schema, table, column.
select odds_ratios from alps.rdata_mtcars_am_logit;
carb: | 6.511 |
cyl: | 2.596 |
disp: | 0.794 |
disp: | 0.794 |
drat: | 433832.055 |
gear: | 2 |
hp: | 1.214 |
mpg: | 0.694 |
qsec: | 0.011 |
vs: | 1.650 |
wt: | 61.308 |
Ok so maybe a normalization feature will be important, but this is just a concept prototype — and so we’ll just look up values for the vehicle for mpg, # of cylinders, weight, and hp and zero out the rest.
insert into rdata.mtcars (mpg, cyl, hp, wt, carb, disp, drat, gear, qsec, vs)
values (0.3, 8, 1580, 38000, 0, 0, 0, 0, 0, 0), (50, 3, 61, 1852, 0, 0, 0, 0, 0, 0);
analyze rdata.mtcars;
-- analyze trigers re-evaluation by alps (could wait for autovacuum)
And now we can get a shotgun estimate of the transmission types of the vehicles.
select am__predicted from rdata.mtcars where id > 31;
am__predicted
---------------
t
t
(2 rows)
It turns out both vehicles do have an automatic transmission but it’s just an illustrative example.
Implementing a background worker
It turns out that implementing a background worker is quite straightforward, even for beginners.
To get started take a look at the worker_spi module which ships with contrib. It offers a complete implementation of a simple background worker. It demonstrates how to set up and manage context for interacting with the Server Prgramming Interface (SPI). SPI is a convenient way to execute arbitrary sql from your program.
In the case of ALPS, the program relies only on the sql interface to the madlib extension, so training models is a simple as generating the appropriate sql in C. Some of the key components of the program are illustrated below.
Registering worker
//some setup referenced from worker_spi
//some GUC details
worker.bgw_main = alps_main;
RegisterBackgroundWorker(&worker);
Well that’s the gist of it.
Polling
while (!got_sigterm) { int rc; rc = WaitLatch(&MyProc->procLatch, WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, alps_poll_seconds * 1000L);
ResetLatch(&MyProc->procLatch); process_columns(); }
a pretty simple polling loop
Executing sql to train madlib model
void train_logit_model(char *schemaname, char *tablename, char* colname, char* coltype, char *support) { StringInfoData buf;
/* Train Model */ elog(LOG, "logistic model %s.%s %s", schemaname, tablename, colname); initStringInfo(&buf); // just a proof of concept wouldn't recommend running this in production! appendStringInfo(&buf, "DROP TABLE IF EXISTS alps.\"%s_%s_%s_logit\";" " DO $$ BEGIN" " PERFORM madlib.logregr_train(" " '%s.%s', 'alps.\"%s_%s_%s_logit\"', '\"%s\"', 'ARRAY[1,%s]'" " , NULL, 20, 'irls');" " EXCEPTION when others then END $$;", schemaname, tablename, colname, schemaname, tablename, schemaname, tablename, colname, colname, support); elog(LOG, "%s", buf.data);
SPI_execute(buf.data, false, 0); }
The source code for ALPS is available online for review and use.
Comments
There are quite a few ways this application could be improved. Adding more models, looking for supports in joined relations via keys, updating predictive join tables rather than adding columns, better heuristics for spreading load, and online algorithms are a few that come to mind.
I’m excited to see what other kinds of background workers may be invented. Thoughts and discussion welcome on #postgresql on freenode.