PostgreSQL is becoming a more popular choice for an embedded database because of its BSD license, relatively low memory footprint and great list of features. A few people have asked me if Slony would be a good choice for replication in an embedded environment. Embedded deployments haven’t been a primary use-case for Slony and some of the challenges you would face are worth writing about.
Embedded deployments involve including PostgreSQL + Slony as part of some other system. Usually this comes up with your shipping an appliance or piece of software to customers and you need too include a database. The database is there to support your application it isn’t the center piece of what your shipping. Embedded databases typically have to run for long periods of time without upgrades and need to be managed by the application they support. Embedded databases don’t usually have a DBA to maintain them. The defining aspects of an ‘embedded deployment’ ,for purposes of this discussion, is that an administrator can’t easily get access to the environment to tune things or troubleshoot when things go wrong.
Installation
Installing slony in your embedded environment isn’t going to be a big deal if you control the version of PostgreSQL that is being used. You should compile slony against your target version of PostgreSQL and ship and install the various slony files as you normally would.
Installations where you don’t control the version of PostgreSQL slony is being used with will be much harder to deal with. Slony needs to compile the stored functions against a specific version of PostgreSQL. Slony stored functions compiled against one version of PostgreSQL *might* work against a different minor version of the same major version but there is no guarantee. If you can’t control your the exact version of PostgreSQL then the safe path would be to pre-compile and ship slony stored functions for all of the possible PostgreSQL versions you will support or require enough infrastructure (gcc, make etc, ) so that you can build slony from sources on the target machine.
Cluster Configuration
If your application deployment only involves 1 database then you don’t need slony. Any interesting Slony cluster involves more than 1 database. Your application will need to come with tooling that will manage the slony installation and configuration across multiple machines. Slonik scripts need to have the hostname and port of the database servers in your cluster. Since hostnames tend to be machine specific your going to have to dynamically generate at least parts of your slonik scripts.
Dynamically generating slonik scripts that where the only difference between different generated scripts are the hostname is easy but dealing with things like a configurable number of nodes can be harder. You need come up with a model that describes the types of cluster configurations that you want to support. A simple model would be only 2 node clusters with 1 origin and 1 receiver. A more complicated model would involve allowing multiple replicas feeding from a single origin/provider shouldn’t be too hard to implement. Situations where the number of origins or cascading providers is configurable will be much harder to model.
Your tooling will need to provision each node with your database schema then call slonik with the appropriate scripts to setup replication. This shouldn’t be too difficult but the problem will be dealing with any errors or failures duirng the process.
Error Handling
Setting up a slony cluster in an automated fashion is pretty common the slony regression tests do automated setup and reconfigurations of slony. Dealing with errors in a reliable and sane manner is probably going to be your biggest challenge embedding Slony in another product and I don’t have a magic solution to this. Slony is a distributed system and there are a lot of things that can go wrong. The following are some of the situations that you need to consider
- One of your database servers (a replica) goes down. Do you leave the server in the cluster in-case it comes back after a reboot or do you remove it from the cluster? Keeping a node in the cluster means that data in sl_log_1 and sl_log_2 is kept around until that node comes back. Unbounded growth of the slony log tables will eventually be a problem but when this becomes a problem is depends on your environment,transaction volume and hardware.
- In a perfect world any data that gets accepted on the origin will replicate without issue onto the slave. Sometimes the world isn’t perfect any differences between the origin and replica, or bugs in Slony, or bugs in your application might mean that the data doesn’t replicate. When Slony can’t replicate a SYNC it it will pause for a bit and retry. Slony will keep retrying until the SYNC works. The idea is that eventually an administrator will fix the underlying problem then replication can continue. In an embedded environment there isn’t an administrator to do this
- Slony is pretty good about proving users with the tools to deal with failures during normal operation, this is generally some combination of restarting slons, the FAILOVER command, and the DROP NODE command. Slony does not provide the same level of command support for dealing with failures that happen during cluster reconfiguration. If slonik (or the machine/network) crashes in the middle of a slonik command you *MIGHT* be able to re-issue the command but you might not be. Many slonik commands only change the event_node and they do so in a single transaction but there are exceptions. Most slonik scripts aren’t transactional, even if all the commands in them are because a ‘WAIT FOR EVENT’ can’t be done in a database transaction meaning that you probably can’t just rerun your script. An experienced Slony administrator can usually figure out what parts of the slonik script worked and which commands need to be resubmitted.
The issues your going to face embedding slony with your application are not that different then issues you will need to deal with with any other hands-off distributed system. If you go this route you need to remember that you are no longer in a common slony use-case but with some planning and testing of different scenarios you should be able to come up with methods of addressing the challenge.
