The following story is only interesting to read if you like it when bad
things happen, or if you don't have a trustworthy backup policy in place. By
trustworthy I mean that
each backup you take
must be tested with a test
recovery job. Only tested backups will prove useful when you need them. So
go read our
Backup and Restore documentation chapter then learn how to setup
Barman for handling
physical backups and
Point In Time Recovery. Get back
when you have proper backups, including recovery testing in place. We are
waiting for you. Back? Ok, let's see how bad you can end up without backups,
and how to still recover. With luck.
Set a trustworthy backup solution, and review your policyDid I mention that a
trustworthy backup solution includes automated testing
of your ability to recover from any and all backup you've been taking? That
might be more important than you think.
This article is going to be quite long. Prepare yourself a cup of your
favorite beaverage.
TL;DR PostgreSQL resilience, flexibility and tolerance
to
bad situations is quite remarkable and allowed us to get
some data back
in the middle of a
destroyed cluster.
The Setup
Most of the customers I visit have already laid out a
backup strategy and
implemented it. Most of them did implement it with custom in-house scripts.
They hire high-skilled engineers who have been doing system administration
for more than a decade, and who are more than able to throw a
shell script
at the problem.
Shell scripting must in hindsight be one of the most difficult things to do
right, given how many times it turns around doing something else entirely
than what the program author though it would. If you want another one of my
quite bulk advices, stop doing any
shell scripting today: a
shell is a nice
interactive tool, if you are doing non-interactive scripting, that's
actually called system programming and you deserve a better tool than that.
My take: shell script makes it harder to write production quality code.In our very case, the customer did realize that a
production setup had been
approved and was running live
before any backup solution was in place. Think
about it for a minute. If you don't have
tested backups in place, it's not
production ready.
Well, the incident was taken seriously, and the usual backup
scripts
deployed as soon as possible. Of course, the shell scripts depended in
non-obvious ways on some parameters (environment variables, database
connections, database setup with special configuration tables and rows). And
being a
shell script, not much verification that the setup was complete had
been implemented, you see.
The Horror Story
And guess what the first thing that
backup script is doing? Of course,
making sure enough place is available on the file system to handle the next
backup. That's usually done by applying a
retention policy and first
removing backups that are
too old given said policy. And this script too did
exactly that.
The problem is that, as some of you already guessed (yes, I see those smiles
trying to hide brains thinking as fast as possible to decide if the same
thing could happen to you too), well, the script configuration had not been
done before entering production. So the script ran without setup, and
without much checking, began making bytes available. By removing any file
more than 5 days old. Right. In.
$PGDATA
.
But recently modified are still there, right?
Exactly, not all the files of the database system had been removed. Surely
something can be done to recover data from a very small number of important
tables? Let's now switch to the present tense and see about it.
Can you spell data loss?Remember, there's no
backups. The
archive_command
is set though, so that's a
first track to consider. After that, what we can do is try to start
PostgreSQL on a copy of the remaining
$PGDATA
and massage it until it allows
us to
COPY
the data out.
The desperate PITR
The
WAL Archive is starting at the file
000000010000000000000009
, which
makes it unusable without a corresponding
base backup, which we don't have.
Well, unless maybe if we tweak the system. We need to first edit the system
identifier, then reset the system to only begin replaying at the file we do
have. With some luck...
A broken clock is still right twice a day, a broken backup never is...Time to try our luck here:
$ export PGDATA=/some/place
$ initdb
$ hexedit $PGDATA/global/pg_control
$ pg_controldata
$ xlogdump /archives/000000010000000000000009
$ pg_resetxlog -f -l 1,9,19 -x 2126 -o 16667 $PGDATA
$ cat > $PGDATA/recovery.conf <<EOF
restore_command = 'gunzip -c /archives/%f.gz > "%p"'
EOF
$ pg_ctl start
Using the transaction data we get from reading the first archive log file we
have with
xlogdump then using
pg_resetxlog and thus accepting to maybe lose
some more data, we still can't start the system in
archive recovery mode,
because the
system identifier is not the same in the WAL files and in the
system's
pg_controldata
output.
So we tweak our fresh cluster to match, by changing the first 8 bytes of the
control file, paying attention to the byte order here. As I already had a
Common Lisp
REPL open on my laptop, the easier for me to switch from decimal
representation of the
database system identifier was so:
(format nil "~,,' ,2:x" 5923145491842547187)
"52 33 3D 71 52 3B 3D F3"
Paying attention to the byte order means that you need to edit the control
file's first 8 bytes in reverse:
F3 3D 3B 52 71 3D 33 52
. But in our case,
no massaging seems to allow PostgreSQL to read from the archives we have.
On to massaging what is remaining in the old cluster then.
The Promotion
I'm usually not doing promotion in such a prominent way, but I clearly
solved the situation thanks to my colleagues from the 24/7 support at
2ndQuadrant, with a special mention to
Andres Freund for inspiration and
tricks:
We also know how to recover your data, but we first insist in proper backupsOh, did I mention about proper backups and how you need to have been
successfully testing them before you can call a service
in production or
have any hope about your recovery abilities? I wasn't sure I did...
Playing fast and loose with PostgreSQL
The damaged cluster is not starting, for lack of important meta-data kind-of
files. First thing missing is
pg_filenode.map
in the
global
directory. Using
xlogdump
it should be possible to recover just this file if it's been
changed in the WAL archives we have, but that's not the case.
Trying to salvage a damage casepg_filenode.map
As this file is only used for shared relations and some bootstraping
situation (you can't read current
pg_class
file node from
pg_class
, as the
file mapping is the information you need to know which file to read), and
knowing that the version on disk was older than 5 days on a cluster recently
put into production, we can allow ourselves trying something: copy the
pg_filenode.map
from another fresh cluster.
My understanding is that this file only changes when doing heavy maintenance
on system tables, like
CLUSTER
or
VACUUM FULL
, which apparently didn't get
done here.
By the way, here's one of those tricks I learnt in this exercise. You can
read the second and fourth columns as filenames in the same directory:
od -j 8 -N $((512-8-8)) -td4 < $PGDATA/global/pg_filenode.map
So copying default
pg_filenode.map
allowed us to pass that error and get to
the next.
pg_clog
Next is the lack of some
pg_clog
files. That's a little tricky because those
binary files contain the
commit log
information and are used to quickly
decide if recent transactions are still in flight, or committed already, or
have been rolled back. We can easily trick the system and declare that all
transaction older than 5 days (remember the bug in the
cleanup script was
about that, right?) have in fact been
committed. A commit in the
CLOG
is a
01
value, and in a single byte we can stuff as many as 4 transactions'
status.
Here's how to create those file from scratch, once you've noticed that
01010101
is in fact the ascii code for the letter
U
.
(code-char #b01010101)
#\U
So to create a series of clog file where all transactions have been
committed, so that we can see the data, we can use the following command
line:
for c in 0000 0001 0002 0003 0004 0005 0006 0007 0008 0009 000A 000B 000C
do
dd if=/dev/zero bs=256k count=1 | tr '\0' 'U' > $c
done
![]()
pg_database
The next step we are confronted to is that PostgreSQL has lost its baking
files for the
pg_database
relation and has no idea what are those
directories in
$PGDATA/base
supposed to be all about. We only have the
numbers!
That said, the customer still had an history of the commands used to install
the database server, so knew in which order the databases where created. So
we had an OID to name mapping. How to apply it?
Well
pg_database
is a shared catalog and the underlying file apparently
isn't that easy to hack around, so the easiest solution is to actually hack
the
CREATE DATABASE
command and have it accepts a
WITH OIDS
option (
OIDS
is
already a PostgreSQL keyword,
OID
is not, and we're not going to introduce
new keywords just for that particular patch).
Equiped with that
hacked version of PostgreSQL it's then possible to use
the new command and create the databases we need with the
OIDS
we know.
Those
OIDS
are then to be found on-disk in the file where
pg_database
is
internally stored, and we can ask the system where that file is:
select oid, relname, pg_relation_filenode(oid)
from pg_class
where relname = 'pg_database';
oid | relname | pg_relation_filenode
------+-------------+----------------------
1262 | pg_database | 12319
(1 row)
Then without surprise we can see:
$ strings $PGDATA/global/12319
postgres
template0
template1
Once that file is copied over to the (running, as it happened) damaged
cluster, it's then possible to actually open a connection to a database. And
that's pretty impressive. But suddenly it didn't work anymore...
Sytem Indexes
This problem was fun to diagnose. The first
psql
call would be fine, but the
second one would always complain with an error you might have never seen in
the field. I sure didn't before.
FATAL: database "dbname" does not exists
DETAIL: Database OID 17838 now seems to belong to "otherdbname"
Part of PostgreSQL startup is building up some caches, and for that it's
using indexes. And we might have made a mistake, or the index is corrupted,
but apparently there's a mismatch somewhere.
But your now all-time favourite development team knew that would happen to
you and is very careful that any feature included in the software is able to
bootstrap itself without using any indexes. Or that in bad situations the
system knows how to resist the lack of those indexes by turning the feature
off, which is the case for
Event Triggers for example, as you can see in the
commit
cd3413ec3683918c9cb9cfb39ae5b2c32f231e8b.
Another kind of indexing systemSo yes, it is indeed possible to start
PostgreSQL and have that marvellous
production ready system avoid any system indexes, for dealing with cases
where you have reasons to think those are corrupted... or plain missing.
$ pg_ctl start -o "-P"
$ cat > $PGDATA/postgresql.conf <<EOF
enable_indexscan = off
enable_bitmapscan = off
enable_indexonlyscan = off
EOF
$ pg_ctl reload
While at it, we edit the
postgresq.conf
and adjust some index usage related
settings, as you can see, because this problem will certain happen outside
of the system indexes.
If you're not using (only) PostgreSQL as your database system of choice, now
is the time to check that you can actually start those other systems when
their internal indexes are corrupted or missing, by the way. I think that
tells a lot about the readiness of the system for production usage, and the
attitude of the developpers towards what happens in
the field.
Also note that with PostgreSQL it's then possible to rebuild those system
indexes using the
REINDEX
command.
So we now have a running PostgreSQL service, servicing the data that still
is available. Well, not quite, We have a PostgreSQL service that accepts to
start and allows connections to a specific database.
pg_proc, pg_operator, pg_cast, pg_aggregate, pg_amop and others
The first query I did try on the new database was against
pg_class
to get
details about the available tables. The
psql
command line tool is doing a
large number of queries in order to serve the
\d
output, the
\dt
one is
usable in our case.
To know what queries are sent to the server by
psql commands use the
\set
ECHO_HIDDEN
toggle.
About any query is now complaining that the target database is missing
files. To understand which file it is, I used the following query in a fresh
cluster. The following example is about an error message where
base/16384/12062
is missing:
select oid, relname, pg_relation_filenode(oid)
from pg_class
where pg_relation_filenode(oid) = 12062;
oid | relname | pg_relation_filenode
------+---------+----------------------
1255 | pg_proc | 12062
(1 row)
In our specific case, no extensions were used. Check that before taking
action here, or at least make sure that the tables you want to try and
recover data from are not using extensions, that would make things so much
more complex.
Here we can just use default settings for most of the system catalogs: we
are using the same set of
functions,
operators,
casts,
aggregates etc as any
other 9.2 system, so we can directly use files created by
initdb and copy
them over where the error message leads.
![]()
pg_namespace
Some error messages are about things we should definitely not ignore. The
content of the
pg_namespace
relation was lost on about all our databases,
and the application here were using non default schema.
To recover from that situation, we need to better understand how this
relation is actually stored:
# select oid, * from pg_namespace;
oid | nspname | nspowner | nspacl
-------+--------------------+----------+----------------------
99 | pg_toast | 10 |
11222 | pg_temp_1 | 10 |
11223 | pg_toast_temp_1 | 10 |
11 | pg_catalog | 10 | {dim=UC/dim,=U/dim}
2200 | public | 10 | {dim=UC/dim,=UC/dim}
11755 | information_schema | 10 | {dim=UC/dim,=U/dim}
(6 rows)
# copy pg_namespace to stdout with oids;
99 pg_toast 10 \N
11222 pg_temp_1 10 \N
11223 pg_toast_temp_1 10 \N
11 pg_catalog 10 {dim=UC/dim,=U/dim}
2200 public 10 {dim=UC/dim,=UC/dim}
11755 information_schema 10 {dim=UC/dim,=U/dim}
![]()
So it's pretty easy here, actually, when you make the right connections:
let's import a default
pg_namespace
file then append to it thanks to
COPY
IN
, being quite careful about using
tabs
(well, unless you use the
delimiter
option of course):
# copy pg_namespace from stdin with oids;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 16443 my_namespace 10 \N
>> \.
And now there's a new
schema in there with the
OID
we want. Wait, how do we
figure out the OID we need?
# select c.oid, relname, relnamespace, nspname
from pg_class c left join pg_namespace n on n.oid = c.relnamespace
where relname = 'bar';
oid | relname | relnamespace | nspname
-------+---------+--------------+---------
16446 | bar | 16443 |
(1 row)
So in the result of that query we have no
nspname
, but we happen to know
that the table
bar
is supposed to be in the schema
my_namespace
.
And believe it or not, that method actually allows you to create a schema in
a database in a running cluster. We directly are editing the catalog files
and editing even the
OID
of the rows we are injecting.
The reason we couldn't do that with
pg_database
, if you're wondering about
that, is that
pg_database
is a shared catalog and part of the bootstrapping,
so that it was impossible to start PostgreSQL until we fix it, and the only
implementation of
COPY we have requires a running PostgreSQL instance.
pg_attributes and pg_attrdef
So now we are able to actually refer to the right relation in a SQL command,
we should be able to dump its content right? Well, it so happens that in
some case it's ok and in some cases it's not.
We are very lucky in that exercise in that
pg_attribute
is not missing. We
might have been able to rebuild it thanks to some
pg_upgrade
implementation
detail by forcing the
OID
of the next table to be created and then issuing
the right command, as given by
pg_dump. By the way, did I mention about
backups? and automated recovery tests?
We need the data attributesIn some cases though, we are missing the
pg_attrdef
relation, wholesale.
That relation is used for default expressions attached to columns, as we can
see in the following example, taken on a working database server:
# \d a
Table "public.a"
Column | Type | Modifiers
--------+---------+------------------------------------------------
id | integer | not null default nextval('a_id_seq'::regclass)
f1 | text |
Indexes:
"a_pkey" PRIMARY KEY, btree (id)
# select adrelid, adnum, adsrc
from pg_attrdef
where adrelid = 'public.a'::regclass;
adrelid | adnum | adsrc
---------+-------+-------------------------------
16411 | 1 | nextval('a_id_seq'::regclass)
(1 row)
# select attnum, atthasdef
from pg_attribute
where attrelid = 'public.a'::regclass
and atthasdef;
attnum | atthasdef
--------+-----------
1 | t
(1 row)
We need to remember that the goal here is to salvage some data out of an
installation where lots is missing, it's not at all about being able to ever
use that system again. Given that, what we can do here is just ignore the
default expression of the columns, by directly updating the catalogs:
# update pg_attribute
set atthasdef = false
where attrelid = 'my_namespace.bar';
COPY the data out! now!
At this point we are now able to actually run the
COPY
command to store the
interesting data into a plain file, that is going to be usable on another
system for analysis.
Not every relation from the get go, mind you, sometime some
default catalogs
are still missing, but in that instance of the data recovery we were able to
replace all the missing pieces of the puzzle by just copying the underlying
files as we did in the previous section.
Conclusion
Really,
PostgreSQL once again surprises me by its flexibility and
resilience. After having tried quite hard to kill it dead, it was still
possible to actually rebuild the
cluster into shape piecemeal and get the
interesting data back.
![]()
I should mention, maybe, that with a proper
production setup including a
Continuous Archiving and Point-in-Time Recovery solution such as
pgbarman,
walmgr,
OmniPITR or
PITRtools; the recovery would have been really simple.
Using an already made solution is often better because they don't just
include
backup support, but also
recovery support. You don't want to figure
out recovery at the time you need it, and you don't want to have to discover
if your backup really is good enough to be recovered from at the time you
need it either. You should be testing your backups, and the only test that
counts is a recovery.
It's even one of those rare cases where using
PostgreSQL replication would
have been a solution: the removing of the files did happen without
PostgreSQL involved, it didn't know that was happening and wouldn't have
replicated that to the standby.