Craig Kerstiens: A health checkup playbook for your Postgres database

I talk with a lot of folks that set their database up, start working with it, and then are surprised by issues that suddenly crop up out of nowhere. The reality is, so many don’t want to have to be a DBA, instead you would rather build features and just have the database work. But your is that a database is a living breathing thing. As the data itself changes what is the right way to query and behave changes. Making sure your database is healthy and performing at it’s maximum level doesn’t require a giant overhaul constantly. In fact you can probably view it similar to how you approach personal health. Regular check-ups allow you to make small but important adjustments without having to make dramatic life altering changes to keep you on the right path.

After years of running and managing literally millions of Postgres databases, here’s my breakdown of what your regular Postgres health check should look like. Consider running this on a monthly basis to be able to make small tweaks and adjustments and avoid the drastic changes.

Cache rules everything around me

For many applications not all the data is accessed all the time. Instead certain datasets are accessed one and then for some period of time, then the data you’re accessing changes. Postgres in fact is quite good at keeping frequently accessed data in memory.

Your cache hit ratio tells you how often your data is served from in memory vs. having to go to disk. Serving from memory vs. going to disk will be orders of magnitude faster, thus the more you can keep in memory the better. Of course you could provision an instance with as much memory as you have data, but you don’t necessarily have to. Instead watching your cache hit ratio and ensuring it is at 99% is a good metric for proper performance.

You can monitor your cache hit ratio with:

SELECTsum(heap_blks_read)asheap_read,sum(heap_blks_hit)asheap_hit,sum(heap_blks_hit)/(sum(heap_blks_hit)+sum(heap_blks_read))asratioFROMpg_statio_user_tables;

Be careful of dead tuples

Under the covers Postgres is essentially a giant append only log. When you write data it appends to the log, when you update data it marks the old record as invalid and writes a new one, when you delete data it just marks it invalid. Later Postgres comes through and vacuums those dead records (also known as tuples).

All those unvacuumed dead tuples are what is known as bloat. Bloat can slow down other writes and create other issues. Paying attention to your bloat and when it is getting out of hand can be key for tuning vacuum on your database.

WITHconstantsAS(SELECTcurrent_setting('block_size')::numericASbs,23AShdr,4ASma),bloat_infoAS(SELECTma,bs,schemaname,tablename,(datawidth+(hdr+ma-(casewhenhdr%ma=0THENmaELSEhdr%maEND)))::numericASdatahdr,(maxfracsum*(nullhdr+ma-(casewhennullhdr%ma=0THENmaELSEnullhdr%maEND)))ASnullhdr2FROM(SELECTschemaname,tablename,hdr,ma,bs,SUM((1-null_frac)*avg_width)ASdatawidth,MAX(null_frac)ASmaxfracsum,hdr+(SELECT1+count(*)/8FROMpg_statss2WHEREnull_frac<>0ANDs2.schemaname=s.schemanameANDs2.tablename=s.tablename)ASnullhdrFROMpg_statss,constantsGROUPBY1,2,3,4,5)ASfoo),table_bloatAS(SELECTschemaname,tablename,cc.relpages,bs,CEIL((cc.reltuples*((datahdr+ma-(CASEWHENdatahdr%ma=0THENmaELSEdatahdr%maEND))+nullhdr2+4))/(bs-20::float))ASottaFROMbloat_infoJOINpg_classccONcc.relname=bloat_info.tablenameJOINpg_namespacennONcc.relnamespace=nn.oidANDnn.nspname=bloat_info.schemanameANDnn.nspname<>'information_schema'),index_bloatAS(SELECTschemaname,tablename,bs,COALESCE(c2.relname,'?')ASiname,COALESCE(c2.reltuples,0)ASituples,COALESCE(c2.relpages,0)ASipages,COALESCE(CEIL((c2.reltuples*(datahdr-12))/(bs-20::float)),0)ASiotta-- very rough approximation, assumes all colsFROMbloat_infoJOINpg_classccONcc.relname=bloat_info.tablenameJOINpg_namespacennONcc.relnamespace=nn.oidANDnn.nspname=bloat_info.schemanameANDnn.nspname<>'information_schema'JOINpg_indexiONindrelid=cc.oidJOINpg_classc2ONc2.oid=i.indexrelid)SELECTtype,schemaname,object_name,bloat,pg_size_pretty(raw_waste)aswasteFROM(SELECT'table'astype,schemaname,tablenameasobject_name,ROUND(CASEWHENotta=0THEN0.0ELSEtable_bloat.relpages/otta::numericEND,1)ASbloat,CASEWHENrelpages<ottaTHEN'0'ELSE(bs*(table_bloat.relpages-otta)::bigint)::bigintENDASraw_wasteFROMtable_bloatUNIONSELECT'index'astype,schemaname,tablename||'::'||inameasobject_name,ROUND(CASEWHENiotta=0ORipages=0THEN0.0ELSEipages/iotta::numericEND,1)ASbloat,CASEWHENipages<iottaTHEN'0'ELSE(bs*(ipages-iotta))::bigintENDASraw_wasteFROMindex_bloat)bloat_summaryORDERBYraw_wasteDESC,bloatDESC

Query courtesy of Heroku’s pg-extras

Over optimizing is a thing

We always our database to be performant, so in order to do that we keep things in memory/cache (see earlier) and we index things so we don’t have to scan everything on disk. But there is a trade-off when it comes to indexing your database. Each index the system has to maintain will slow down your write throughput on the database. This is fine when you do need to speed up queries, as long as they’re being utilized. If you added an index years ago, but something within your application changed and you no longer need it best to remove it.

Postgres makes it simply to query for unused indexes so you can easily give yourself back some performance by removing them:

SELECTschemaname||'.'||relnameAStable,indexrelnameASindex,pg_size_pretty(pg_relation_size(i.indexrelid))ASindex_size,idx_scanasindex_scansFROMpg_stat_user_indexesuiJOINpg_indexiONui.indexrelid=i.indexrelidWHERENOTindisuniqueANDidx_scan<50ANDpg_relation_size(relid)>5*8192ORDERBYpg_relation_size(i.indexrelid)/nullif(idx_scan,0)DESCNULLSFIRST,pg_relation_size(i.indexrelid)DESC;

Check in on your query performance

In an earlier blog post we talked about how useful pg_stat_statements was for monitoring your database query performance. It records a lot of valuable stats about which queries are run, how fast they return, how many times their run, etc. Checking in on this set of queries regularly can tell you where is best to add indexes or optimize your application so your query calls may not be so excessive.

Thanks to a HN commenter on our earlier post we have a great query that is easy to tweak to show different views based on all that data:

SELECTquery,calls,total_time,total_time/callsastime_per,stddev_time,rows,rows/callsasrows_per,100.0*shared_blks_hit/nullif(shared_blks_hit+shared_blks_read,0)AShit_percentFROMpg_stat_statementsWHEREquerynotsimilarto'%pg_%'andcalls>500--ORDER BY calls--ORDER BY total_timeorderbytime_per--ORDER BY rows_perDESCLIMIT20;

An apple a day…

A regular health check for your database saves you a lot of time in the long run. It allows you to gradually maintain and improve things without huge re-writes. I’d personally recommend a monthly or bimonthly check-in on all of the above to ensure things are in a good state.