So, another month, another PostgreSQL update. This one is a lot more critical than most because it patches up to three data loss issues, depending on what version of PostgreSQL you're on and which features you're using. Two kinds of users need to schedule an update downtime for this weekend, if at all possible:
Annoyingly, you'll have to do some additional stuff after you update:
VACUUM; -- optionally, ANALYZE as well
This second step is critical for users on 9.3, and a generally good idea for users on other versions. Note that, while VACUUM is a non-blocking operation, it can create a lot of IO, so it's a good idea to do this during a slow traffic period and monitor it for pile-ups.
More information about the replication issue is here.
So, how did this happen? Well, all three issues were unexpected side effects of fixes we applied for other issues in earlier versions. For example, the replication issue is the result of the combination of two independent fixes for failover issues, both of which needed to be fixed. Since all of these issues depend on both timing and heavy traffic, and their effect is subtle (as in, a few missing or extra rows), none of our existing testing was capable of showing them.
If anyone wants to devise destruction tests to detect subtle data corruption issues in PostgreSQL before we release code -- if anyone can -- please offer to help!
- Users on version 9.3
- Users of binary replication
Annoyingly, you'll have to do some additional stuff after you update:
- If using binary replication, you need to take a new base backup of each replica after updating.
- You should run the following on each production database after updating:
VACUUM; -- optionally, ANALYZE as well
This second step is critical for users on 9.3, and a generally good idea for users on other versions. Note that, while VACUUM is a non-blocking operation, it can create a lot of IO, so it's a good idea to do this during a slow traffic period and monitor it for pile-ups.
More information about the replication issue is here.
So, how did this happen? Well, all three issues were unexpected side effects of fixes we applied for other issues in earlier versions. For example, the replication issue is the result of the combination of two independent fixes for failover issues, both of which needed to be fixed. Since all of these issues depend on both timing and heavy traffic, and their effect is subtle (as in, a few missing or extra rows), none of our existing testing was capable of showing them.
If anyone wants to devise destruction tests to detect subtle data corruption issues in PostgreSQL before we release code -- if anyone can -- please offer to help!