On several occasions, I’ve been talking with groups of PostgreSQL users and the question comes up, “If I use PostgreSQL, why would I want to use Hadoop?” There are many answers and the question is usually asked when people don’t really understand the details about Hadoop, but let’s just focus on a single use case. Backups.
For most larger databases, on line backups are used with point in time recovery. This lets administrators to backup, or more importantly restore, their databases quickly. The trade off for this, is that you’re making a physical copy of the database files so if your database is a terabyte, you’re backup will be a terabyte before you compress it. If you’re keeping weekly backups and you have a company policy to retain your backups for months, it’ll require a lot of storage. That’s where Hadoop comes in.
At the core of Hadoop is the Hadoop Distributed File System (HDFS), which isn’t a POSIX compliant file system, but it does have some pretty great properties. It’s designed to run on inexpensive hardware while still be fault tolerant. This means that you can go out and buy some inexpensive drives and put them in some older desktops you have lying around the office and you’ll have a highly redundant storage cluster. No need to buy an expensive SAN or NAS device or ship your data to a cloud service like Amazon S3.
Leveraging HDFS for your PostgreSQL backups is pretty straight forward. Assuming you have a Hadoop cluster already setup, you’ll just need to put the Hadoop client on your server. From your PostgreSQL server, first test that you can connect to the cluster and do a simple directory listing.
jim@jim-XPS:~$ hadoop dfs -ls hdfs://192.168.122.91:9000/user Found 3 items drwxr-xr-x - bigsql supergroup 0 2013-06-26 12:15 /user/bigsql drwxr-xr-x - bigsql supergroup 0 2013-06-26 12:00 /user/hive drwxr-xr-x - bigsql supergroup 0 2013-07-08 12:17 /user/postgres
If you run into any errors, you most likely need to change the fs.default.name property in core-site.xml to use the correct URL instead of localhost.
Once you have your connectivity configured correctly, you can leverage the cluster as a place to keep your backups. Create your base backup with the tool of your choice and once it’s done, just copy it to HDFS.
hadoop dfs -copyFromLocal basebackup_20130711.tar.gz hdfs://192.168.122.91:9000/user/postgres/backups/base
You can even set you’re archive command to write out archive your WAL files directly to HDFS. Just be careful with this one if you’re switching log files pretty frequently. The command takes a bit longer than a simple rsync.
archive_command = 'hadoop dfs -copyFromLocal %p hdfs://192.168.122.91:9000/user/postgres/backups/wal/%f'