With a default configuration of PostgreSQL, the information in the PostgreSQL logs is generally pretty small with some warnings and the occasional typo in psql. However, once you start tweaking the configuration parameters a bit and you start logging more things like query times, the logs get a lot bigger. Having PostgreSQL generate over 10GB of logs per day is not unheard of if you set log_min_duration_statement low enough on a busy server. You can analyze logs of that size with pgBadger pretty easily, but to get the real value, you want to look at the logs of the course of time. Things like how does the query load look compared to the same time last week, last month or last year. If you save the daily reports from pgBadger, you can look at them over time, but that’s a manaul process. Another option is to put your logs into Hadoop and use MapReduce to analyze the information over time.
The Hadoop project has a service that is built for exactly that purpose called Flume. It lets you setup a Flume agent that will listen for the logs and then push them into Hadoop. Setting everything up is pretty simple.
First, you’ll want to setup PostgreSQL to send the logs to syslog. In postgresql.conf, you’ll set:
log_destination = 'syslog'
Then you want to setup your syslog server to forward messages to Flume. Depending on your syslog server, you’ll setup a rule like this in rsyslog.conf
*.* @127.0.0.1:5140
Finally, you’ll need to start Flume where it will listen for syslog messages and push the messages into an HFDS sink.
# pglogs.conf: A single-node Flume configuration # Name the components on this agent pglogs.sources = sl pglogs.sinks = k1 pglogs.channels = c1 # Describe/configure the source pglogs.sources.sl.type = syslogudp pglogs.sources.sl.port = 5140 pglogs.sources.sl.host = 0.0.0.0 # Describe the sink pglogs.sinks.k1.type = hdfs # Use a channel which buffers events in memory pglogs.channels.c1.type = memory pglogs.channels.c1.capacity = 1000 pglogs.channels.c1.transactionCapacity = 100 #Describe the HDFS Sink pglogs.sinks.k1.hdfs.path = /user/postgres/logs pglogs.sinks.k1.hdfs.filePrefix = pg pglogs.sinks.k1.hdfs.fileSuffix = .log pglogs.sinks.k1.hdfs.fileType = DataStream # Bind the source and sink to the channel pglogs.sources.sl.channels = c1 pglogs.sinks.k1.channel = c1