Craig Kerstiens: Configuring memory for Postgres

June 12, 2018, 9:44 am

≫ Next: Pavel Stehule: Article about PostgreSQL 11

≪ Previous: Vincenzo Romano: Temporary functions (kind of) without schema qualifiers

work_mem is perhaps the most confusing setting within Postgres. work_mem is a configuration within Postgres that determines how much memory can be used during certain operations. At its surface, the work_mem setting seems simple: after all, work_mem just specifies the amount of memory available to be used by internal sort operations and hash tables before writing data to disk. And yet, leaving work_mem unconfigured can bring on a host of issues. What perhaps is more troubling, though, is when you receive an out of memory error on your database and you jump in to tune work_mem, only for it to behave in an un-intuitive manner.

Setting your default memory

The work_mem value defaults to 4MB in Postgres, and that’s likely a bit low. This means that per Postgres activity (each join, some sorts, etc.) can consume 4MB before it starts spilling to disk. When Postgres starts writing temp files to disk, obviously things will be much slower than in memory. You can find out if you’re spilling to disk by searching for temporary file within your PostgreSQL logs when you have log_temp_files enabled. If you see temporary file, it can be worth increasing your work_mem.

On Citus Cloud (our fully-managed database as a service that scales out Postgres horizontally), we automatically tune work_mem based on the overall memory available to the box. Our tuning is based on the years of experience of what we’ve seen work for a variety of production Postgres workloads, coupled with statistics to compute variations based on cluster sizing.

It’s tough to get the right value for work_mem perfect, but often a sane default can be something like 64 MB, if you’re looking for a one size fits all answer.

It’s not just about the memory for queries

Let’s use an example to explore how to think about optimizing your work_mem setting.

Say you have a certain amount of memory, say 10 GB. If you have 100 running Postgres queries, and each of those queries has a 10 MB connection overhead, then 100*10 MB (1 GB) of memory is taken up by the 100 connections—which leaves you with 9GB of memory.

With 9 GB of memory remaining, say you give 90 MB to work_mem for the 100 running queries. But wait, it’s not that simple. Why? Well, work_mem isn’t set on a per-query basis, rather, it’s set based on the number of sort/hash operations. But how many shorts/hashes and joins happen per query? Now that is a complicated question. A complicated question made more complicated if you have other processes that also consume memory, such as autovacuum.

Let’s reserve a little for maintenance tasks then and for vacuum and we’ll be okay then as long as we limit our connections right? Not so fast my friend.

Postgres now has parallel queries. If you’re using Citus for parallelism you’ve had this for a while, but now you have it on single node Postgres as well. What this means is on a single query you can have multiple processes running and performing work. This can result in some significant improvements in speed of queries, but each of those running processes can consume the specified amount of work_mem. With our 64 MB default and 100 connections we could now have each of those running a query per each core consuming far more memory than we anticipated.

More work_mem, more problems

So we can see that getting it perfect is a little more work than ideal. Let’s go back a little and try this more simply… we can start work_mem small at say 16 MB and gradually increase work_mem when we see temporary file. But why not give each query as much memory as it would like? If we were to just say each process could consume up to 1 GB of memory what’s the harm? Well the other extreme out there is that queries begin consuming too much memory, more than you have available on your box. When that happens you get 100 queries that have 5 different sort operations and a few hash joins in them it’s in fact very possible to exhaust all the memory available to your database.

When you consume more memory than is available on your machine you can start to see out of out of memory errors within your Postgres logs, or in worse cases the OOM killer can start to randomly kill running processes to free up memory. An out of memory error in Postgres simply errors on the query you’re running, where as the the OOM killer in linux begins killing running processes which in some cases might even include Postgres itself.

When you see an out of memory error you either want to increase the overall RAM on the machine itself by upgrading to a larger instance OR you want to decrease the amount of memory that work_mem uses. Yes, you read that right: out-of-memory it’s better to decrease work_mem instead of increase since that is the amount of memory that can be consumed by each process and too many operations are leveraging up to that much memory.

General guidance for work_mem

While you can continual tune and tweak work_mem a couple of broad guidelines for pairing to your workload can generally get you into a good spot:

If you have a number of short running queries that run very frequently and perform simple lookups and joins then maintaining a lower work_mem is ideal, in this case you get diminishing returns by allowing it to be significantly higher because it’s simply unused. If you’re workload is relatively few active queries at a time that are doing very complex sorts and joins then granting more memory to prevent things from spilling can give you great returns.

Happy database tuning

Postgres powerful feature set and flexibility means you have a lot of knobs you can turn and levers you can pull in tuning it. Postgres is often used for embedded systems, for time series data, for OLTP and OLAP as well. This flexibility can often mean an overwhelming set of options when tuning. On Citus Cloud we’ve configured this to be suitable for most workloads we see, think of it as one size fits most and then when you need you you’re able to customize. If you’re not running on Citus Cloud consider leveraging pgtune to help you get to a good starting point.

↧

Pavel Stehule: Article about PostgreSQL 11

June 13, 2018, 1:52 am

≫ Next: Andrew Dunstan: Road test your patch in one command

≪ Previous: Craig Kerstiens: Configuring memory for Postgres

My new article is in Czech language, but Google translator can help.

↧

Andrew Dunstan: Road test your patch in one command

June 14, 2018, 5:39 am

≫ Next: Haroon .: Using Window Functions for Time Series IoT Analytics in Postgres-BDR

≪ Previous: Pavel Stehule: Article about PostgreSQL 11

If you have Docker installed on your development machine, there is a simple way to road test your code using the buildfarm client, in a nicely contained environment.

These preparatory steps only need to be done once. First clone the repository that has the required container definitions:

git clone https://github.com.PGBuildFarm/Dockerfiles.git bf-docker

Then build a container image to run the command (in this example we use the file based on Fedora 28):

cd bf-docker
docker build --rm=true -t bf-f28 Dockerfile.f28 .

Make a directory to contain all the build artefacts:

mkdir buildroot-f28

That’s all the preparation required. Now you can road test your code with this command:

docker run -v buildroot-f28:/app/buildroot \
  -v /path/to/postgres/source:/app/pgsrc bf-f28 \ 
  run_build.pl --config=build-fromsource.conf

The config file can be customized if required, but this is a pretty simple way to get started.

↧

Haroon .: Using Window Functions for Time Series IoT Analytics in Postgres-BDR

June 13, 2018, 9:35 am

≫ Next: Andrew Dunstan: Road test your patch in one command

≪ Previous: Andrew Dunstan: Road test your patch in one command

Internet of Things tends to generate large volumes of data at a great velocity. Often times this data is collected from geographically distributed sources and aggregated at a central location for data scientists to perform their magic i.e. find patterns, trends and make predictions.

Let’s explore what the IoT Solution using Postgres-BDR has to offer for Data Analytics. Postgres-BDR is offered as an extension on top of PostgreSQL 10 and above. It is not a fork. Therefore, we get the power of complete set of analytic functions that PostgreSQL has to offer. In particular, I am going to play around with PostgreSQL’s Window Functions here to analyze a sample of time series data from temperature sensors.

Let’s take an example of IoT temperature sensor time series data spread over a period of 7 days. In a typical scenario, temperature sensors are sending readings every minute. Some cases could be even more frequent. For the sake of simplicity, however, I am going to use one reading per day. The objective is to use PostgreSQL’s Window Functions for running analytics and that would not change with increasing the frequency of data points.

Here’s our sample data. Again, the number of table fields in a real world IoT temperature sensor would be higher, but our fields of interest in this case are restricted to timestamp of the temperature recording, device that reported it and the actual reading.

CREATETABLEiot_temperature_sensor_data(
    ts timestamp without time zone,
    device_id text,
    reading float
);

and we add some random timeseries temperature sensor reading data for seven consecutive days

         ts          |            device_id             | reading ---------------------+----------------------------------+--------- 2017-06-01 00:00:00 | ff0d1c4fd33f8429b7e3f163753a9cb0 |      10 2017-06-02 00:00:00 | d125efa43a62af9f50c1a1edb733424d |       9 2017-06-03 00:00:00 | 0b4ee949cc1ae588dd092a23f794177c |       1 2017-06-04 00:00:00 | 8b9cef086e02930a808ee97b87b07b03 |       3 2017-06-05 00:00:00 | d94d599467d1d8d3d347d9e9df809691 |       6 2017-06-06 00:00:00 | a9c1e8f60f28935f3510d7d83ba54329 |       9 2017-06-07 00:00:00 | 6fb333bd151b4bcc684d21b248c06ca3 |       9

Some of the very obvious questions of course:

What was the lowest temperature reading and when?
What was the highest temperature reading and when?

Quite understandably, we can run Min() and MAX() on ‘reading’ column to get lowest and highest temperature readings:

SELECTMIN(reading),MAX(reading)FROM
    iot_temperature_sensor_data;

 min | max 
-----+----- 
   1 |  10(1 row)

While it is useful, it doesn’t tell us which corresponding dates and/or devices reported lowest and highest temperatures.

SELECT
    ts,
    device_id,
    reading
FROM
    iot_temperature_sensor_data
WHERE
    reading =(SELECTMIN(reading)FROM
            iot_temperature_sensor_data);

         ts          |            device_id             | reading 
---------------------+----------------------------------+--------- 2017-06-03 00:00:00 | 0b4ee949cc1ae588dd092a23f794177c |       1(1 row)

So far so good!

However what about the questions like:
– Where does a given day rank in terms of temperature reading for the given week in a specified order?
– Let’s take a look to see how low or high the temperature was compared to previous or following day. So essentially we are looking to find the delta in temperature readings for a given day compared to one previous and/or following day.

This is where Window Functions come into play allowing us to compare and contrast values in relation to current row.

So if we wanted to find where a given day ranks in terms of its temperature reading with lowest temperature ranked at the top:

SELECT
    ts,
    device_id,
    reading,rank()OVER(ORDERBY
        reading)FROM
    iot_temperature_sensor_data;

       ts          |            device_id             | reading | rank
 ---------------------+----------------------------------+---------+------
 2017-06-03 00:00:00 | 0b4ee949cc1ae588dd092a23f794177c |       1 |    1
 2017-06-04 00:00:00 | 8b9cef086e02930a808ee97b87b07b03 |       3 |    2
 2017-06-05 00:00:00 | d94d599467d1d8d3d347d9e9df809691 |       6 |    3
 2017-06-02 00:00:00 | d125efa43a62af9f50c1a1edb733424d |       9 |    4
 2017-06-06 00:00:00 | a9c1e8f60f28935f3510d7d83ba54329 |       9 |    4
 2017-06-07 00:00:00 | 6fb333bd151b4bcc684d21b248c06ca3 |       9 |    4
 2017-06-01 00:00:00 | ff0d1c4fd33f8429b7e3f163753a9cb0 |      10 |    7
(7 rows)

If we looked at the rank column, it looks good except that we notice rank 7 right after 4. This is because the next rank gets skipped because we have three rows with identical temperature reading of 9. What if I wanted not to skip ? PostgreSQL provides rank_dense() exactly to serve the same purpose.

SELECT
    ts,
    device_id,
    reading,dense_rank()OVER(ORDERBY
        reading)FROM
    iot_temperature_sensor_data;

         ts          |            device_id             | reading | dense_rank
---------------------+----------------------------------+---------+------------
 2017-06-03 00:00:00 | 0b4ee949cc1ae588dd092a23f794177c |       1 |          1 2017-06-04 00:00:00 | 8b9cef086e02930a808ee97b87b07b03 |       3 |          2 2017-06-05 00:00:00 | d94d599467d1d8d3d347d9e9df809691 |       6 |          3 2017-06-02 00:00:00 | d125efa43a62af9f50c1a1edb733424d |       9 |          4 2017-06-06 00:00:00 | a9c1e8f60f28935f3510d7d83ba54329 |       9 |          4 2017-06-07 00:00:00 | 6fb333bd151b4bcc684d21b248c06ca3 |       9 |          4 2017-06-01 00:00:00 | ff0d1c4fd33f8429b7e3f163753a9cb0 |      10 |          5(7 rows)

How about which days saw the maximum rise in temperature compared to previous day ?

SELECT
    ts,
    device_id,
    reading,
    reading -lag(reading,1)OVER(ORDERBY
        reading)AS diff
FROM
    iot_temperature_sensor_data;

         ts          |            device_id             | reading | diff
---------------------+----------------------------------+---------+------
 2017-06-03 00:00:00 | 0b4ee949cc1ae588dd092a23f794177c |       1 |
  2017-06-04 00:00:00 | 8b9cef086e02930a808ee97b87b07b03 |       3 |    2
 2017-06-05 00:00:00 | d94d599467d1d8d3d347d9e9df809691 |       6 |    3
 2017-06-02 00:00:00 | d125efa43a62af9f50c1a1edb733424d |       9 |    3
 2017-06-06 00:00:00 | a9c1e8f60f28935f3510d7d83ba54329 |       9 |    0
 2017-06-07 00:00:00 | 6fb333bd151b4bcc684d21b248c06ca3 |       9 |    0
 2017-06-01 00:00:00 | ff0d1c4fd33f8429b7e3f163753a9cb0 |      10 |    1
(7 rows)

We see that the last column shows the rise in temperature vs previous day. A quick visual inspection would show that the maximum rise in temperature is 3 degrees on 2017-06-02 00:00:00 and then again on 2017-06-05 00:00:00. With a little bit of CTE magic, we could list the days that saw the maximum rise in temperature.

WITH temperature_data AS(SELECT
        ts,
        device_id,
        reading,
        reading -lag(reading,1)OVER(ORDERBY
            reading)AS diff
    FROM
        iot_temperature_sensor_data
)SELECT
    ts,
    diff
FROM
    temperature_data
WHERE
    diff =(SELECTMAX(diff)FROM
            temperature_data);

         ts          | diff ---------------------+------ 2017-06-05 00:00:00 |    3 2017-06-02 00:00:00 |    3

Here is a list of Window Functions that PostgreSQL and Postgres-BDR support:

row_number()
percent_rank()
cume_dist()
ntile(num_bucketsinteger)
lead(valueanyelement [, offsetinteger [, defaultanyelement ]])
first_value(valueany)
last_value(valueany)
nth_value(valueany, nthinteger)

For further reading, please refer to the PostgreSQL documentation on Window Functions in PostgreSQL.

↧

Andrew Dunstan: Road test your patch in one command

June 14, 2018, 5:39 am

≫ Next: gabrielle roth: PDXPUG: June meeting

≪ Previous: Haroon .: Using Window Functions for Time Series IoT Analytics in Postgres-BDR

If you have Docker installed on your development machine, there is a simple way to road test your code using the buildfarm client, in a nicely contained environment.

These preparatory steps only need to be done once. First clone the repository that has the required container definitions:

git clone https://github.com/PGBuildFarm/Dockerfiles.git bf-docker

Then build a container image to run the command (in this example we use the file based on Fedora 28):

cd bf-docker
docker build --rm=true -t bf-f28 Dockerfile.f28 .

Make a directory to contain all the build artefacts:

mkdir buildroot-f28

That’s all the preparation required. Now you can road test your code with this command:

docker run -v buildroot-f28:/app/buildroot \
  -v /path/to/postgres/source:/app/pgsrc bf-f28 \ 
  run_build.pl --config=build-fromsource.conf

The config file can be customized if required, but this is a pretty simple way to get started.

↧

gabrielle roth: PDXPUG: June meeting

June 14, 2018, 6:35 am

≫ Next: Marco Slot: Scalable incremental data aggregation on Postgres and Citus

≪ Previous: Andrew Dunstan: Road test your patch in one command

When: 6-8pm Thursday June 21, 2018
Where: iovation
Who: Mark Wong
What: Intro to OmniDB with PostgreSQL

OmniDB is an open source browser-based app designed to access and manage many different Database Management systems, e.g. PostgreSQL, Oracle and MySQL. OmniDB can run either as an App or via Browser, combining the flexibility needed for various access paths with a design that puts security first.

OmniDB’s main objective is to offer an unified workspace with all functionalities needed to manipulate different DBMS. It is built with simplicity in mind, designed to be a fast and lightweight browser-based application.

Get a tour of OmniDB with PostgreSQL!

Mark leads the 2ndQuadrant performance practice as a Performance Consultant for English Speaking Territories, based out of Oregon in the USA. He is a long time Contributor to PostgreSQL, co-organizer of the Portland PostgreSQL User Group, and serves as a Director and Treasurer
for the United States PostgreSQL Association.

—
If you have a job posting or event you would like me to announce at the meeting, please send it along. The deadline for inclusion is 5pm the day before the meeting.
—

Our meeting will be held at iovation, on the 3rd floor of the US Bancorp Tower at 111 SW 5th (5th & Oak). It’s right on the Green & Yellow Max lines. Underground bike parking is available in the parking garage; outdoors all around the block in the usual spots. No bikes in the office, sorry! For access to the 3rd floor of the plaza, please either take the lobby stairs to the third floor or take the plaza elevator (near Subway and Rabbit’s Cafe) to the third floor. There will be signs directing you to the meeting room. All attendess must check in at the iovation front desk.

See you there!

↧

Marco Slot: Scalable incremental data aggregation on Postgres and Citus

June 14, 2018, 9:36 am

≫ Next: Tatsuo Ishii: Even more fine load balancing control with Pgpool-II 4.0

≪ Previous: gabrielle roth: PDXPUG: June meeting

Many companies generate large volumes of time series data from events happening in their application. It’s often useful to have a real-time analytics dashboard to spot trends and changes as they happen. You can build a real-time analytics dashboard on Postgres by constructing a simple pipeline:

Load events into a raw data table in batches
Periodically aggregate new events into a rollup table
Select from the rollup table in the dashboard

For large data streams, Citus (an open source extension to Postgres that scales out Postgres horizontally) can scale out each of these steps across all the cores in a cluster of Postgres nodes.

One of the challenges of maintaining a rollup table is tracking which events have already been aggregated—so you can make sure that each event is aggregated exactly once. A common technique to ensure exactly-once aggregation is to run the aggregation for a particular time period after that time period is over. We often recommend aggregating at the end of the time period for its simplicity, but you cannot provide any results before the time period is over and backfilling is complicated.

Building rollup tables in a new and different way

In this blog post, we’ll introduce a new approach to building rollup tables which addresses the limitations of using time windows. When you load older data into the events table, the rollup tables will automatically be updated, which enables backfilling and late arrivals. You can also start aggregating events from the current time period before the current time period is over, giving a more real time view of the data.

We assume all events have an identifier which is drawn from a sequence and provide a simple SQL function that enables you to incrementally aggregate ranges of sequence values in a safe, transactional manner.

We tested this approach for a CDN use case and found that a 4-node Citus database cluster can simultaneously:

Ingest and aggregate over a million rows per second
Keep the rollup table up-to-date within ~10s
Answer analytical queries in under 10ms.

SQL infrastructure for incremental aggregation

To do incremental aggregation, we need a way to distinguish which events have been aggregated. We assume each event has an identifier that is drawn from a Postgres sequence, and events have been aggregated up to a certain sequence number. To track this, we can create a rollups table that also contains the name of the events table and the sequence name.

CREATETABLErollups(nametextprimarykey,event_table_nametextnotnull,event_id_sequence_nametextnotnull,last_aggregated_idbigintdefault0);

As of Postgres 10, we can use the pg_sequence_last_value function to check the most recently issued sequence number. However, it would not be safe to simply aggregate all events up to the most recent sequence value. There might still be in-progress writes to the events table that were assigned lower sequence values, but are not yet visible when the aggregation runs. To wait for in-progress writes to finish, we use an explicit table lock as discussed in our recent Postgres locking tips blog post. New writes will briefly block from the moment the LOCK command is executed. Once existing writes are finished, we have the lock, and then we immediately release it to allow new writes to continue. We can do that because we know that all new writes will have higher sequence number, and we can allow those writes to continue as long as we don’t include them in the current aggregation. As a result, we can obtain a range of new events that are ready to be aggregated with minimal interruption on the write side.

We codified this logic in a PL/pgSQL function which returns the range of sequence numbers that are ready to be aggregated. This PL/pgSQL function can be used in a transaction with an INSERT..SELECT, where the SELECT part filters out sequence numbers that fall outside the range.

CREATEORREPLACEFUNCTIONincremental_rollup_window(rollup_nametext,OUTwindow_startbigint,OUTwindow_endbigint)RETURNSrecordLANGUAGEplpgsqlAS$function$DECLAREtable_to_lockregclass;BEGIN/*
     * Perform aggregation from the last aggregated ID + 1 up to the last committed ID.
     * We do a SELECT .. FOR UPDATE on the row in the rollup table to prevent
     * aggregations from running concurrently.
     */SELECTevent_table_name,last_aggregated_id+1,pg_sequence_last_value(event_id_sequence_name)INTOtable_to_lock,window_start,window_endFROMrollupsWHEREname=rollup_nameFORUPDATE;IFNOTFOUNDTHENRAISE'rollup ''%'' is not in the rollups table',rollup_name;ENDIF;IFwindow_endISNULLTHEN/* sequence was never used */window_end:=0;RETURN;ENDIF;/*
     * Play a little trick: We very briefly lock the table for writes in order to
     * wait for all pending writes to finish. That way, we are sure that there are
     * no more uncommitted writes with a identifier lower or equal to window_end.
     * By throwing an exception, we release the lock immediately after obtaining it
     * such that writes can resume.
     */BEGINEXECUTEformat('LOCK %s IN EXCLUSIVE MODE',table_to_lock);RAISE'release table lock';EXCEPTIONWHENOTHERSTHENEND;/*
     * Remember the end of the window to continue from there next time.
     */UPDATErollupsSETlast_aggregated_id=window_endWHEREname=rollup_name;END;$function$;

Now let’s look at an example of using this function for a typical rollup use case.

Incremental aggregation of page view data

A simple example of a real-time analytics dashboard I like to use is for monitoring page views on a website. In a past life I worked for a Content Delivery Network (CDN), where such a dashboard is essential both to operators and customers.

Back when I worked for the CDN, it would have been almost unthinkable to store full page view logs in a SQL database like Postgres. But with the Citus extension to Postgres, you can now scale Postgres as well as any distributed storage system, while supporting distributed queries, indexes, and rollups.

You can now simply store raw events directly in a table and process them later. To deal with large write volumes, it is helpful to minimise index maintenance overhead. We recommend using a BRIN index for looking up ranges of sequence IDs during aggregation. A BRIN index takes very little storage space and is cheap to maintain in this case.

CREATETABLEpage_views(site_idint,pathtext,client_ipinet,view_timetimestamptzdefaultnow(),view_idbigserial);-- Allow fast lookups of ranges of sequence IDsCREATEINDEXview_id_idxONpage_viewsUSINGBRIN(view_id);-- Citus only: distribute the table by site IDSELECTcreate_distributed_table('page_views','site_id');

The rollup table will keep track of the the number of views per page for a particular minute. Once populated, you can run SQL queries on the table to aggregate further.

CREATETABLEpage_views_1min(site_idint,pathtext,period_starttimestamptz,view_countbigint,primarykey(site_id,path,period_start));-- Citus only: distribute the table by site IDSELECTcreate_distributed_table('page_views_1min','site_id');-- Add the 1-minute rollup to the rollups tableINSERTINTOrollups(name,event_table_name,event_id_sequence_name)VALUES('page_views_1min_rollup','page_views','page_views_view_id_seq');

Now you can define a function for incrementally aggregating the page views using INSERT..SELECT..ON CONFLICT.. and the incremental_rollup_window to select a batch of new, unaggregated events:

CREATEORREPLACEFUNCTIONdo_page_view_aggregation(OUTstart_idbigint,OUTend_idbigint)RETURNSrecordLANGUAGEplpgsqlAS$function$BEGIN/* determine which page views we can safely aggregate */SELECTwindow_start,window_endINTOstart_id,end_idFROMincremental_rollup_window('page_views_1min_rollup');/* exit early if there are no new page views to aggregate */IFstart_id>end_idTHENRETURN;ENDIF;/* aggregate the page views */INSERTINTOpage_views_1min(site_id,path,period_start,view_count)SELECTsite_id,path,date_trunc('minute',view_time),count(*)ASview_countFROMpage_viewsWHEREview_idBETWEENstart_idANDend_idGROUPBYsite_id,path,date_trunc('minute',view_time)ONCONFLICT(site_id,path,period_start)DOUPDATESETview_count=page_views_1min.view_count+EXCLUDED.view_count;END;$function$;

After inserting into the page_views table, the aggregation can be updated by periodically running:

SELECT*FROMdo_page_view_aggregation();

By running the do_page_view_aggregation function frequently, you can keep the rollup table up-to-date within seconds of the raw events table. You can also safely load older page view data with a view_time in the past, because these records will still have higher sequence numbers.

If you want to keep track of more advanced statistics in your rollup table, we recommend you check out how to incrementally update count distinct using HLL and heavy hitters using TopN.

Querying the rollup table from the dashboard

If you’re building a dashboard that provides real-time insights into page views, you could actually run queries directly on the raw event data. Citus will parallelise the query at different levels to achieve high performance.

The following query gets the number of page views for an entire site per minute for the last 30 minutes, which takes between 800-900ms on Citus with 1 billion rows loaded.

SELECTdate_trunc('minute',view_time)period_start,count(*)FROMpage_viewsWHEREsite_id=2ANDview_time>='2018-06-07 08:54:00'GROUPBYperiod_startORDERBYperiod_start;period_start|sum------------------------+-------2018-06-0708:54:00+00|364822018-06-0708:55:00+00|512722018-06-0708:56:00+00|552162018-06-0708:57:00+00|749362018-06-0708:58:00+00|15776…(30rows)Time:869.478ms

This may actually be fast enough to power a single user dashboard, but if there are multiple users then the query uses ways too much raw CPU time. Fortunately, the equivalent query on the rollup table is more than 100x faster because the table is smaller and indexed:

citus=#SELECTperiod_start,sum(view_count)FROMpage_views_1minWHEREsite_id=2ANDperiod_start>='2018-06-07 08:54:00'GROUPBYperiod_startORDERBYperiod_start;period_start|sum------------------------+-------2018-06-0708:54:00+00|364822018-06-0708:55:00+00|512722018-06-0708:56:00+00|552162018-06-0708:57:00+00|749362018-06-0708:58:00+00|15776...(30rows)Time:5.473ms

It’s clear that when we want to support a larger number of users, rollup tables are the way to go.

Aggregation pipeline performance in Citus and Postgres

To test the performance of the data pipeline we randomly generate page view data for 1,000 sites and 100,000 pages. We compared the performance of a distributed Citus Cloud formation (4*r4.4xlarge) against a single Postgres node with equivalent hardware (r4.16xlarge RDS).

We also tried using Aurora, but since it runs an older version of Postgres the pg_sequence_last_value function was unavailable.

We loaded 1 billion rows into the page_views table using the COPY command over 4 connections in batches of 1 million rows. Below are the average data loading speeds.

To actually be able to process 1 million rows per second, it’s important for the aggregation process to keep up with the stream of data, while it is being loaded.

During the data load, we ran SELECT * FROM do_page_view_aggregation() in a loop to update the page_views_1min table. Citus parallelises the aggregation across all the cores in the cluster and is easily able to keep up with the COPY stream. In contrast, single-node Postgres does not parallelise INSERT…SELECT commands and could not keep up with its own ingest speed.

On Citus, every individual run took around 10 seconds, which means that the page_views_1min table was never more than 10 seconds behind on the page_views table. Whereas with a single Postgres node, the aggregation could not keep up, so it started taking arbitrarily long (>10 minutes).

A database for real-time analytics at scale

Being able to express your aggregations in SQL and using indexes that keep your aggregations performant are both invaluable in doing real-time analytics on large data streams. While a single Postgres server cannot always keep up with large data streams, Citus transforms Postgres into a distributed database and enables you to scale out across multiple cores, and process over a million rows per second.

↧

Tatsuo Ishii: Even more fine load balancing control with Pgpool-II 4.0

June 14, 2018, 9:36 pm

≫ Next: Andrew Dunstan: Road test your patch in one command

≪ Previous: Marco Slot: Scalable incremental data aggregation on Postgres and Citus

While back go, I posted an article titled "More load balancing fine control" to explain a new feature of upcoming Pgpool-II 4.0. Today I would like talk about yet another new feature of load balancing in Pgpool-II.

Pgpool-II is already able to control read query load balancing in several granularity:
(See the document for more details).

Whether enable load balancing or not (load_balance_mode)
By database (database_redirect_preference_list)
By application name (app_name_redirect_preference_list)
By functions used in the query (white_function_list, black_function_list)
Statement level (/*NO LOAD BALANCE*/ comment)

The last method allows the statement level control on load balancing but it needs to rewrite a query which is often impossible if you are using commercial software.

A new configuration parameter called "black_query_pattern_list" allows you to disable load balancing for queries specified by the parameter. The parameter takes a regular expression string. If a query matched with the expression, load balancing is disabled for the query: the query is sent to the primary (master) PostgreSQL server.

Here are some examples.

We have:
black_query_pattern_list = 'SELECT \* FROM t1\;;'
in pgpool.conf. Note that some special characters must be qualified by using '\' (back slash) characters.

We have no SELECTs issued to both primary and standby.

test=# show pool_nodes;
node_id | hostname | port | status | lb_weight | role   | select_cnt | load_balance_node | replication_delay | last_status_change
---------+----------+-------+--------+-----------+---------+------------+-------------------+-------------------+---------------------
0       | /tmp     | 11002 | up     | 0.500000 | primary | 0          | false             | 0                 | 2018-06-15 11:57:25
1       | /tmp     | 11003 | up     | 0.500000 | standby | 0          | true              | 0                 | 2018-06-15 11:57:25
(2 rows)

If following query is issued, then the "select_cnt" column of standby should be incremented since standby is the load balance node (notice the "role" column").

test=# SELECT * FROM t1;

But because of the effect of black_query_pattern_list, the SELECT is redirected to primary.

test=# show pool_nodes;
node_id | hostname | port | status | lb_weight | role   | select_cnt | load_balance_node | replication_delay | last_status_change
---------+----------+-------+--------+-----------+---------+------------+-------------------+-------------------+---------------------
0       | /tmp     | 11002 | up     | 0.500000 | primary | 1          | false             | 0                 | 2018-06-15 11:57:25
1       | /tmp     | 11003 | up     | 0.500000 | standby | 0          | true              | 0                 | 2018-06-15 11:57:25
(2 rows)

However, "SELECT * FROM t1 WHERE i = 0" will be sent to standby since the expression specied in black_query_pattern_list does not match with the query.

test=# select * from t1 where i = 1;
i
---
(0 rows)

test=# show pool_nodes;
node_id | hostname | port | status | lb_weight | role   | select_cnt | load_balance_node | replication_delay | last_status_change
---------+----------+-------+--------+-----------+---------+------------+-------------------+-------------------+---------------------
0       | /tmp     | 11002 | up     | 0.500000 | primary | 1          | false             | 0                 | 2018-06-15 11:57:25
1       | /tmp     | 11003 | up     | 0.500000 | standby | 1          | true              | 0                 | 2018-06-15 11:57:25
(2 rows)

If you want to match with any query using table t2, you want to specify something like:

  .* t2;;.* t2 .*;

The first part ".* t2;" matches any query without any qualification (example: "SELECT * FROM t2"), while latter matches any query having space after t2 (example: "SELECT * FROM t2 WHERE i = 1).

Please note that Pgpool-II does not do any syntax analysis on the query here. Please be careful when specifying a regular expression. For instance, "SELECT * FROM t1 WHERE EXISTS (SELECT * FROM t2)" will not match with any of the regular expression pattern above. So the query will be sent to standby. This may or may not follow your expectations.

By the way, you may already noticed the extra column "last_status_change" of "show pool_nodes" command. This indicates the times stamp when any of "status" or "role" has been changed. This is useful when you want to see log lines when failover happend last time.

↧

Andrew Dunstan: Road test your patch in one command

June 14, 2018, 5:39 am

≫ Next: Hans-Juergen Schoenig: PostgreSQL clusters: Monitoring performance

≪ Previous: Tatsuo Ishii: Even more fine load balancing control with Pgpool-II 4.0

If you have Docker installed on your development machine, there is a simple way to road test your code using the buildfarm client, in a nicely contained environment.

These preparatory steps only need to be done once. First clone the repository that has the required container definitions:

git clone https://github.com/PGBuildFarm/Dockerfiles.git bf-docker

Then build a container image to run the command (in this example we use the file based on Fedora 28):

cd bf-docker
docker build --rm=true -t bf-f28 -f Dockerfile.fedora-28 .

Make a directory to contain all the build artefacts:

mkdir buildroot-f28

That’s all the preparation required. Now you can road test your code with this command:

docker run -v buildroot-f28:/app/buildroot \
  -v /path/to/postgres/source:/app/pgsrc bf-f28 \ 
  run_build.pl --config=build-fromsource.conf

The config file can be customized if required, but this is a pretty simple way to get started.

↧

Hans-Juergen Schoenig: PostgreSQL clusters: Monitoring performance

June 18, 2018, 1:30 am

≫ Next: Sebastian Insausti: A Performance Cheat Sheet for PostgreSQL

≪ Previous: Andrew Dunstan: Road test your patch in one command

When people are talking about database performance monitoring they usually think of inspecting one PostgreSQL database server at a time. While this is certainly useful it can also be quite beneficial to inspect the status of an entire database cluster or to inspect a set of servers working together at once. Fortunately there are easy means to achieve that with PostgreSQL. How this works can be outlined in this post.

pg_stat_statements: The best tool to monitor PostgreSQL performance

If you want to take a deep loop at PostgreSQL performance there is really no way around pg_stat_statements. It offers a lot of information and is really easy to use.

To install pg_stat_statements, the following steps are necessary:

run “CREATE EXTENSION pg_stat_statements” in your desired database
add the following line to postgresql.conf:
- shared_preload_libraries = ‘pg_stat_statements’
restart PostgreSQL

Once this is done, PostgreSQL will already be busy collecting data on your database hosts. However, how can we create a “clusterwide pg_stat_statements” view so that we can inspect an entire set of servers at once?

Using pg_stat_statements to check an entire database cluster

Our goal is to show data from a list of servers in a single view. One way to do that is to make use of PostgreSQL’s foreign data wrapper infrastructure. We can simply connect to all servers in the cluster and unify the data in a single view.

Let us assume we have 3 servers, a local machine, “a_server”, and “b_server”. Let us get started by connecting to the local server to run the following commands:

CREATE USER dbmonitoring LOGIN PASSWORD 'abcd' SUPERUSER;
GRANT USAGE ON SCHEMA pg_catalog TO dbmonitoring;
GRANT ALL ON pg_stat_statements TO dbmonitoring;

In the first step I created a simple user to do the database monitoring. Of course you can handle users and so on differently but it seems like an attractive idea to use a special user for that purpose.

The next command enables the postgres_fdw extension, which is necessary to connect to those remote servers we want to access:

CREATE EXTENSION postgres_fdw;

Then we can already create “foreign servers”. Here is how those servers can be created:

CREATE SERVER pg1 FOREIGN DATA WRAPPER postgres_fdw
       OPTIONS (host 'a_server', dbname 'a');
CREATE SERVER pg2 FOREIGN DATA WRAPPER postgres_fdw
       OPTIONS (host 'b_server', dbname 'b');

Just replace the hostnames and the database names with your data and run those commands. The next step is already about user mapping: It might easily happen that local users are not present on the other side so it is necessary to create some sort of mapping between local and remote users:

CREATE USER MAPPING FOR public
       SERVER pg1
       OPTIONS (user 'postgres', password 'abcd');

CREATE USER MAPPING FOR public
       SERVER pg2
       OPTIONS (user 'postgres', password 'abcd');

In this case we will login as user “postgres”. Now that two servers and the user mappings are ready, we can import the remote schema into a local schema:

CREATE SCHEMA monitoring_a;
IMPORT FOREIGN SCHEMA public
       LIMIT TO (pg_stat_statements)
       FROM SERVER pg1
       INTO monitoring_a;

CREATE SCHEMA monitoring_b;
IMPORT FOREIGN SCHEMA public
       LIMIT TO (pg_stat_statements)
       FROM SERVER pg2
       INTO monitoring_b;

For each schema there will be a separate schema. This makes it very easy to drop things again and to handle various incarnations of the same data structure.

Wiring things together

The last thing to do in our main database, is to connect those remote tables with our local data. The easiest way to achieve that is to use a simple view:

CREATE VIEW monitoring_performance AS
SELECT 'localhost'::text AS node, *
FROM pg_stat_statements
UNION ALL
SELECT 'server a'::text AS node, *
FROM monitoring_a.pg_stat_statements
UNION ALL
SELECT 'server b'::text AS node, *
FROM monitoring_b.pg_stat_statements;

The view will simply unify all the data and add an additional column at the beginning.

PostgreSQL performance monitoring for clusters

Our system is now ready to use and we can already start to run useful analysis:

SELECT *,
       sum(total_time) OVER () AS cluster_total_time,
       sum(total_time) OVER (PARTITION BY node) AS node_total_time,
       round((100 * total_time / sum(total_time) OVER ())::numeric, 4) AS percentage_total,
       round((100 * total_time / sum(total_time) OVER (PARTITION BY node))::numeric, 4) AS percentage_node
FROM   monitoring_performance
LIMIT 10;

The query will return all the raw data and add some percentage numbers on top of this data.

If you are interested in further information on pg_state_statements consider reading the following blog post too: https://www.cybertec-postgresql.com/en/pg_stat_statements-the-way-i-like-it/

The post PostgreSQL clusters: Monitoring performance appeared first on Cybertec.

↧

Sebastian Insausti: A Performance Cheat Sheet for PostgreSQL

June 18, 2018, 2:17 am

≫ Next: Konstantin Evteev: Recovery use cases for Logical Replication in PostgreSQL 10

≪ Previous: Hans-Juergen Schoenig: PostgreSQL clusters: Monitoring performance

Performance is one of the most important and most complex tasks when managing a database. It can be affected by the configuration, the hardware or even the design of the system. By default, PostgreSQL is configured with compatibility and stability in mind, since the performance depends a lot on the hardware and on our system itself. We can have a system with a lot of data being read but the information does not change frequently. Or we can have a system that writes continuously. For this reason, it is impossible to define a default configuration that works for all types of workloads.

In this blog, we will see how one goes about analyzing the workload, or queries, that are running. We shall then review some basic configuration parameters to improve the performance of our PostgreSQL database. As we mentioned, we will see only some of the parameters. The list of PostgreSQL parameters is extensive, we would only touch on some of the key ones. However, one can always consult the official documentation to delve into the parameters and configurations that seem most important or useful in our environment.

EXPLAIN

One of the first steps we can take to understand how to improve the performance of our database is to analyze the queries that are made.

PostgreSQL devises a query plan for each query it receives. To see this plan, we will use EXPLAIN.

The structure of a query plan is a tree of plan nodes. The nodes in the lower level of the tree are scan nodes. They return raw rows from a table. There are different types of scan nodes for different methods of accessing the table. The EXPLAIN output has a line for each node in the plan tree.

world=# EXPLAIN SELECT * FROM city t1,country t2 WHERE id>100 AND t1.population>700000 AND t2.population<7000000;
                               QUERY PLAN                                
--------------------------------------------------------------------------
Nested Loop  (cost=0.00..734.81 rows=50662 width=144)
  ->  Seq Scan on city t1  (cost=0.00..93.19 rows=347 width=31)
        Filter: ((id > 100) AND (population > 700000))
  ->  Materialize  (cost=0.00..8.72 rows=146 width=113)
        ->  Seq Scan on country t2  (cost=0.00..7.99 rows=146 width=113)
              Filter: (population < 7000000)
(6 rows)

This command shows how the tables in our query will be scanned. Let's see what these values correspond to that we can observe in our EXPLAIN.

The first parameter shows the operation that the engine is performing on the data in this step.
Estimated start-up cost. This is the time spent before the output phase can begin.
Estimated total cost. This is stated on the assumption that the plan node is run to completion. In practice, a node's parent node might stop short of reading all available rows.
Estimated number of rows output by this plan node. Again, the node is assumed to be run to completion.
Estimated average width of rows output by this plan node.

The most critical part of the display is the estimated statement execution cost, which is the planner's guess at how long it will take to run the statement. When comparing how effective one query is against the other, we will in practice be comparing the cost values of them.

It's important to understand that the cost of an upper-level node includes the cost of all its child nodes. It's also important to realize that the cost only reflects things that the planner cares about. In particular, the cost does not consider the time spent transmitting result rows to the client, which could be an important factor in the real elapsed time; but the planner ignores it because it cannot change it by altering the plan.

The costs are measured in arbitrary units determined by the planner's cost parameters. Traditional practice is to measure the costs in units of disk page fetches; that is, seq_page_cost is conventionally set to 1.0 and the other cost parameters are set relative to that.

EXPLAIN ANALYZE

With this option, EXPLAIN executes the query, and then displays the true row counts and true run time accumulated within each plan node, along with the same estimates that a plain EXPLAIN shows.

Let's see an example of the use of this tool.

world=# EXPLAIN ANALYZE SELECT * FROM city t1,country t2 WHERE id>100 AND t1.population>700000 AND t2.population<7000000;
                                                     QUERY PLAN                                                      
----------------------------------------------------------------------------------------------------------------------
Nested Loop  (cost=0.00..734.81 rows=50662 width=144) (actual time=0.081..22.066 rows=51100 loops=1)
  ->  Seq Scan on city t1  (cost=0.00..93.19 rows=347 width=31) (actual time=0.069..0.618 rows=350 loops=1)
        Filter: ((id > 100) AND (population > 700000))
        Rows Removed by Filter: 3729
  ->  Materialize  (cost=0.00..8.72 rows=146 width=113) (actual time=0.000..0.011 rows=146 loops=350)
        ->  Seq Scan on country t2  (cost=0.00..7.99 rows=146 width=113) (actual time=0.007..0.058 rows=146 loops=1)
              Filter: (population < 7000000)
              Rows Removed by Filter: 93
Planning time: 0.136 ms
Execution time: 24.627 ms
(10 rows)

If we do not find the reason why our queries take longer than they should, we can check this blog for more information.

VACUUM

The VACUUM process is responsible for several maintenance tasks within the database, one of them recovering storage occupied by dead tuples. In the normal operation of PostgreSQL, tuples that are deleted or obsoleted by an update are not physically removed from their table; they remain present until a VACUUM is performed. Therefore, it is necessary to do the VACUUM periodically, especially in frequently updated tables.

If the VACUUM is taking too much time or resources, it means that we must do it more frequently, so that each operation has less to clean.

In any case you may need to disable the VACUUM, for example when loading data in large quantities.

The VACUUM simply recovers space and makes it available for reuse. This form of the command can operate in parallel with the normal reading and writing of the table, since an exclusive lock is not obtained. However, the additional space is not returned to the operating system (in most cases); it is only available for reuse within the same table.

VACUUM FULL rewrites all the contents of the table in a new disk file without additional space, which allows the unused space to return to the operating system. This form is much slower and requires an exclusive lock on each table while processing.

VACUUM ANALYZE performs a VACUUM and then an ANALYZE for each selected table. This is a practical way of combining routine maintenance scripts.

ANALYZE collects statistics on the contents of the tables in the database and stores the results in pg_statistic. Subsequently, the query planner uses these statistics to help determine the most efficient execution plans for queries.

Configuration parameters

To modify these parameters we must edit the file $ PGDATA / postgresql.conf. We must bear in mind that some of them require a restart of our database.

max_connections

Determines the maximum number of simultaneous connections to our database. There are memory resources that can be configured per client, therefore, the maximum number of clients can suggest the maximum amount of memory used.

superuser_reserved_connections

In case of reaching the limit of max_connection, these connections are reserved for superuser.

shared_buffers

Sets the amount of memory that the database server uses for shared memory buffers. If you have a dedicated database server with 1 GB or more of RAM, a reasonable initial value for shared_buffers is 25% of your system's memory. Larger configurations for shared_buffers generally require a corresponding increase in max_wal_size, to extend the process of writing large amounts of new or modified data over a longer period of time.

temp_buffers

Sets the maximum number of temporary buffers used for each session. These are local session buffers used only to access temporary tables. A session will assign the temporary buffers as needed up to the limit given by temp_buffers.

work_mem

Specifies the amount of memory that will be used by the internal operations of ORDER BY, DISTINCT, JOIN, and hash tables before writing to the temporary files on disk. When configuring this value we must take into account that several sessions be executing these operations at the same time and each operation will be allowed to use as much memory as specified by this value before it starts to write data in temporary files.

This option was called sort_mem in older versions of PostgreSQL.

maintenance_work_mem

Specifies the maximum amount of memory that maintenance operations will use, such as VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY. Since only one of these operations can be executed at the same time by a session, and an installation usually does not have many of them running simultaneously, it can be larger than the work_mem. Larger configurations can improve performance for VACUUM and database restores.

When the autovacuum is executed, this memory can be assigned the number of times in which the autovacuum_max_workers parameter is configured, so we must take this into account, or otherwise, configure the autovacuum_work_mem parameter to manage this separately.

fsync

If fsync is enabled, PostgreSQL will try to make sure that the updates are physically written to the disk. This ensures that the database cluster can be recovered to a consistent state after an operating system or hardware crash.

While disabling fsync generally improves performance, it can cause data loss in the event of a power failure or a system crash. Therefore, it is only advisable to deactivate fsync if you can easily recreate your entire database from external data.

checkpoint_segments (PostgreSQL < 9.5)

Maximum number of record file segments between automatic WAL control points (each segment is normally 16 megabytes). Increasing this parameter can increase the amount of time needed to recover faults. In a system with a lot of traffic, it can affect the performance if it is set to a very low value. It is recommended to increase the value of checkpoint_segments on systems with many data modifications.

Also, a good practice is to save the WAL files on a disk other than PGDATA. This is useful both for balancing the writing and for security in case of hardware failure.

As of PostgreSQL 9.5 the configuration variable "checkpoint_segments" was removed, and was replaced by "max_wal_size" and "min_wal_size"

max_wal_size (PostgreSQL >= 9.5)

Maximum size the WAL is allowed to grow between the control points. The size of WAL can exceed max_wal_size in special circumstances. Increasing this parameter can increase the amount of time needed to recover faults.

min_wal_size (PostgreSQL >= 9.5)

When the WAL file is kept below this value, it is recycled for future use at a checkpoint, instead of being deleted. This can be used to ensure that enough WAL space is reserved to handle spikes in the use of WAL, for example when executing large batch jobs.

wal_sync_method

Method used to force WAL updates to the disk. If fsync is disabled, this setting has no effect.

wal_buffers

The amount of shared memory used for WAL data that has not yet been written to disk. The default setting is about 3% of shared_buffers, not less than 64KB or more than the size of a WAL segment (usually 16MB). Setting this value to at least a few MB can improve write performance on a server with many concurrent transactions.

effective_cache_size

This value is used by the query planner to take into account plans that may or may not fit in memory. This is taken into account in the cost estimates of using an index; a high value makes it more likely that index scans are used and a low value makes it more likely that sequential scans will be used. A reasonable value would be 50% of the RAM.

default_statistics_target

PostgreSQL collects statistics from each of the tables in its database to decide how queries will be executed on them. By default, it does not collect too much information, and if you are not getting good execution plans, you should increase this value and then run ANALYZE in the database again (or wait for the AUTOVACUUM).

synchronous_commit

Specifies whether the transaction commit will wait for the WAL records to be written to disk before the command returns a "success" indication to the client. The possible values are: "on", "remote_apply", "remote_write", "local" and "off". The default setting is "on". When it is disabled, there may be a delay between the time the client returns, and when the transaction is guaranteed to be secure against a server lock. Unlike fsync, disabling this parameter does not create any risk of database inconsistency: a crash of the operating system or database may result in the loss of some recent transactions allegedly committed, but the state of the database will be exactly the same as if those transactions had been cancelled cleanly. Therefore, deactivating synchronous_commit can be a useful alternative when performance is more important than the exact certainty about the durability of a transaction.

Logging

There are several types of data to log that may be useful or not. Let's see some of them:

log_min_error_statement: Sets the minimum logging level.
log_min_duration_statement: Used to record slow queries in the system.
log_line_prefix: Adheres information at the beginning of each log line.
log_statement: You can choose between NONE, DDL, MOD, ALL. Using "all" can cause performance problems.

Design

In many cases, the design of our database can affect performance. We must be careful in our design, normalizing our schema and avoiding redundant data. In many cases it is convenient to have several small tables instead of one huge table. But as we said before, everything depends on our system and there is not a single possible solution.

We must also use the indexes responsibly. We should not create indexes for each field or combination of fields, since, although we do not have to travel the entire table, we are using disk space and adding overhead to write operations.

Another very useful tool is the management of connection pool. If we have a system with a lot of load, we can use this to avoid saturating the connections in the database and to be able to reuse them.

Hardware

As we mentioned at the beginning of this blog, hardware is one of the important factors that directly affect the performance of our database. Let's see some points to keep in mind.

Memory: The more RAM we have, the more memory data we can handle, and that means better performance. The speed of writing and reading on disk is much slower than in memory, therefore, the more information we can have in memory, the better performance we will have.
CPU: Maybe it does not make much sense to say this, but the more CPU we have, the better. In any case it is not the most important in terms of hardware, but if we can have a good CPU, our processing capacity will improve and that directly impacts our database.
Hard disk: We have several types of discs that we can use, SCSI, SATA, SAS, IDE. We also have solid state disks. We must compare quality / price, which we should use to compare its speed. But the type of disk is not the only thing to consider, we must also see how to configure them. If we want good performance, we can use RAID10, keeping the WALs on another disk outside the RAID. It is not recommended to use RAID5 since the performance of this type of RAID for databases is not good.

Conclusion

After taking into account the points mentioned in this blog, we can perform a benchmark to verify the behavior of the database.

It is also important to have our database monitored to determine if we are facing a performance problem and to be able to solve it as soon as possible. For this task there are several tools such as Nagios, ClusterControl or Zabbix, among others, that allow us not only to monitor, but with some of them, allows us to take proactive action before the problem occurs. With ClusterControl, in addition to monitoring, administration and several other utilities, we can receive recommendations on what actions we can take when receiving performance alerts. This allows us to have an idea of how to solve potential problems.

This blog is not intended to be an exhaustive guide to how to improve database performance. Hopefully, It gives a clearer picture of what things can become important and some of the basic parameters that can be configured. Do not hesitate to let us know if we’ve missed any important ones.

Tags:

PostgreSQL

performance

tuning

↧

Konstantin Evteev: Recovery use cases for Logical Replication in PostgreSQL 10

June 6, 2018, 1:02 am

≫ Next: Regina Obe: PostgresVision 2018 Slides and Impressions

≪ Previous: Sebastian Insausti: A Performance Cheat Sheet for PostgreSQL

Hi. My name is Konstantin Evteev, I’m a DBA Unit Leader of Avito.ru, one of worlds top classifieds. In Avito, ads are stored in PostgreSQL databases. At the same time, for many years already the logical replication has been actively used. With its help, the following issues are successfully solved: the growth of data volume and growth of number of requests to it, the scaling and the distribution of the load, the delivery of data to the DWH and to the search subsystems, inter-base and intersystem data synchronization etc. But nothing happens “for free” — at the output we have a complex distributed system. Hardware failures can happen — you need to be always ready for it. There is plenty of samples of logical replication configuration and lots of success stories about using it. But with all this documentation there is nothing about samples of the recovery after crashes and data corruptions, moreover there are no ready-made tools for it. Over the years of constantly using PgQ replication, we have gained extensive experience, implemented our own add-ins and extensions to restore and synchronize data after crashes in distributed data systems.

In this report, we would like to show how our recovery use cases around Londiste (PGQ in general) in distributed data processing could be switched to a new logical replication in PostgreSQL 10. This research was done by Mikhail Tyurin (tmihail@bk.ru), Sergey Burladyan (eshkinkot@gmail.com) and me (konst583@gmail.com). And many thanks to Stas Kelvich from Postgres Professional for useful comments, discussions and review. We started making it when PostgreSQL 10 Alpha was released and are still working on it.

I want to highlight one aspect of infrastructure: streaming and logical replication types are set up in asynchronous mode. Synchronous replication may not cope with a huge OLTP load.

Logical replication is a common case of data denormalization. We usually put some data in normal form in one place and then we need to redistribute it to a different place and even to a different structure (do some data transformations) due to different conditions:

storing requirements (optimal hardware utilization);
making easier data processing;
access management;
etc.

The typical use-cases for logical replication are:

giving an access to the replicated data to different groups of users;
storage flexibility;
event tracking;
scaling;
data distribution;
flexible replication chains;
data transformation;
upgrading between major versions of PostgreSQL.

Different logical replication solutions, tools and frameworks from community allow us to make our own improvements and tools. We replicate data from one Postgres to another to achieve different use-cases. In Avito we successfully use Logical Replication for:

Dictionaries delivery.
Load balancing. Data copies are stored in different places so we can get it in parallel.
Partial replication to services. Delivery of a part of the table for example a category or user’s group for a particular service or application.
Data streaming to search systems. In our case it is data delivery to an external index.
Persistent queue. Our own implementation of central storage for all states of our tables/objects at the end of transaction which are serialized on the lock on primary key. The main customer of this persistent queue is DWH.
Interservice communication.

That is all about values of logical replication in brief. In PostgreSQL 10 logical replication became a built-in feature. And Avito architecture is a successful example of using standalone implementation of the logical replication (SkyTools, Londiste, PgQ).

Architecture

Londiste implementation

Sourses:

The infrastructure for Londiste is as following Provider, Subscriber, Ticker, Londiste Worker.

In Londiste replication queue is set of tables and changes are written to them with the help of triggers. At the same time the ticker writes down snapshots on provider. And then londiste worker get changes with following query — give me changes that was not visible in previous snapshot and becomes visible in current snapshot. Then applies them on subscriber. That is all about Londiste architecture in brief. The main thing I want to highlight for you that replication queue is stored in tables.

Builtin logical replication

Builtin logical replication is built with an architecture similar to physical streaming replication.

It is implemented by “walsender” and “apply” processes. The walsender process starts logical decoding of the WAL and loads the standard logical decoding plugin (pgoutput). The plugin transforms the changes read from WAL to the logical replication protocol and filters the data according to the publication specification. The data is then continuously transferred using the streaming replication protocol to the apply worker, which maps the data to local tables and applies the individual changes as they are received, in correct transactional order.

The apply process on the subscriber database always runs with session_replication_role set to replica, which produces the usual effects on triggers and constraints. The logical replication apply process currently only fires row triggers, not statement triggers. The initial table synchronization, however, is implemented like a COPY command and thus fires both row and statement triggers for INSERT. For our Undo case we use row trigger, you will see it later.

The main difference Between trigger based and builtin replication is queue — in builtin replication it is WAL while in trigger based solution it is the table.

Before describing our recovery use cases I want to draw your attention to few details in implementation of built in logical replication that make it easier to understand implementation of our recovery use cases.

Replication progress

Replication origins used for tracking replication progress in PostgreSQL 10. Replication progress is tracked in a shared memory table (ReplicationState) that’s dumped to disk every checkpoint. And progress of applying logical replication is written to WAL ( COMMIT 2017-XX-XX XX::XXX::XX origin: node 1 , lsn 0/30369C0, at 2017–11–01 XXXXXX msk).

Replication progress in shared memory is not transactional. Even if you are inside a Repeatable Read transaction, the value of pr_replication_origin_status can be changed. It means that nowadays without stopping logical replication we can’t make a consistency database dump of the logical subscription. But we can stop the replication before taking a snapshot, log the LSN to some external place, start dump and after these actions start replication. Then switch a new logical consumer to a new slot, and set its LSN to the subscription’s LSN value which we logged before we start dumping it.

pg_subscrition and pg_dump

When dumping logical replication subscriptions, pg_dump will generate CREATE SUBSCRIPTION commands that use the NOCONNECT option. After restoring from this dump all replicated tables are unlinked to subscription. To create a link we need to refresh the subscription, but it is not allowed (ERROR alter subscription is not allowed for disabled subscriptions) e.d. subscription is restored with option connect = false, but we need connect = true and enable = false. And if we turn on the replication, there is a time interval before the refresh publication command is executed, and as there was no link between the replicated tables and subscription there will be a silent data corruption which means that all the changes for logical replication will not be applied to the subscription. That’s why it is not possible to connect a new subscription with pg_dump and to execute the refresh publication command. To create a new copy of subscriber you should do the following:

run pg_dump without subscriptions;
create the subscription manually with a link to an existing slot, after restoring the dump;
before starting the replication, switch the slot to the logical subscription’s LSN value which we logged before we start dumping it.

Let’s see how to move our experience on restoring and synchronizing data after crashes in distributed data processing systems from SkyTools to built-in logical replication in PostgreSQL 10.

Avito recovery use cases

Reinitializing subscriber from another subscriber

Reinitalizing in brief

1. The Infrastructure for this user case is as follows:

Provider (main) — is a publisher — the source of our logical replication.
Two replicas (repca) — logical consumers: repca1, repca2.

There are two replicas for recovery cases. Crashes can happen so if you want to have fast recovery, reserve every node.

This case is suItable for systems where:

the provider is overloaded;
the size of replicated tables is extremely big and there is some logic in triggers on these tables on the subscriber’s side, i.e. the duration of initial data copying process is estimated to be too long, e.g. few days;
the case when the replica is a derivative of the original data, for example some signals of events on the source side.

Reinitializing subscriber from another subscriber in brief:

“Resubscribe” — create a new logical slot without a consumer

Wait till the new subscriber’s position is seen on the old one and copy (pg_dump — j)

Change the queue position according to destination

Start replication

In short, we have two logical replicas. We recreated a logical subscriber with no influence on provider — this is the main feature of our solution. We did it with Londiste and now let’s discover how to do the same with logical replication in PostgreSQL 10

Full reinitializing implementation

Creating a logical slot for a new subscriber:

2. Disabling active subscription:

3. Logging current LSN:

4. pg_dump:

5. At the same time:

6. Creating subscription.

7. Moving LSN.

8. Checking the state of subscription:

9. Enabling subscription:

UNDO recovery on the destination side

UNDO in brief

1 The Infrastructure for this use case is as follows:

Provider (main) — is a publisher — the source of our logical replication.
Provider’s standby
Replica (repca1) — logical consumer.

There is a master with its standby. If the master crashes, there is a probability that the repca (logical consumer) will be in future in relation to provider’s standby, e.g. logical replication works faster than binary due to overloaded standby or queries on standby that locks applying of the new WAL etc. For this case Avito has developed Undo framework. Undo looks like an audit log of events, that were applied on a logical consumer. It is the opposite actions: for Delete it is Insert, for Update it is another Update, for Insert it is Delete. After promotion of standby we take the progress of logical replication on standby and compare it to the subscriber’s position. Then apply Undo on the subscriber for all the same actions related to standby . As a result the provider’s and subscriber’s data will be in a consistent state, and starting from this point it is ok to turn on the logical replication.

If we don’t set our provider and subscriber databases in consistent state after the crash mentioned above, there is a probability that replication will crash due to unique key constraint violation error. And still there is a chance that everything will look fine, but the data is not consistent and there is a silent data corruption which is very hard to detect and even harder to explain and reproduce.

Here are schema images with the main steps for Undo recovery case:

Step 1: provider’s crash

Step 2: promoting provider’s standby

Step 3: applying Undo

Step 4: Starting Logical Replication

The full implementation of Undo can be found on Github. I just want to highlight the main components here:

Undo log table.

select

    id, LSN, dst_schema, dst_table, undo_cmd, cmd_data, cmd_pk

from

    undo_log order by id

2. Undo trigger that writes opposite actions to Undo log table: for Delete it is Insert, for Update it is another Update, for Insert it is Delete.

3. Undo apply function — for applying Undo in descendent order.

Key features and specialties for Undo implementation in PostgreSQL 10:

In trigger-based replication solutions like Londiste tracking the progress of logical replication for Undo recovery case is very easy in comparison with the built-in logical replication. In PostgreSQL 10 implementation of the logical replication you should wait until the provider’s standby has replayed all WAL files, then log the LSN position, you can get it by executing pg_last_xlog_replay_location function . If recovery is still in progress, LSN will increase monotonically. If recovery has completed then this value will remain static at the value of the last WAL record applied during that recovery. When the server has been started normally without recovery the function pg_last_xlog_replay_locationreturns NULL. We will use this LSN for applying Undo to this state, after applying Undo we can promote the provider’s standby.
A physical slot can be created on the standby unlike a logical one. So you mustn’t turn on client’s activity on the promoted standby before creating a logical slot. It is necessary to save all data changes for logical replication, if you do not create a logical slot there will be a gap in your data.
To prevent undoing new changes, writing transactions on the subscriber can be turned on only after applying Undo.
Also you can read from the subscriber before applying Undo, but it is very risky — you can get inconsistent data.

Full implementation of Undo case — provider’s crash and switching to the provider’s standby:

Write down the WAL replay LSN before promotion.

2. Logical Replication Slot isn’t replicated to the standby, that’s why you shouldn’t turn on traffic immediately after promotion of standby.

3. There are some changes for Undo.

4. Current subscriber’s LSN.

5. Applying Undo.

6. Enabling logical replication.

Few sources, one subscription and Undo

In trigger get subscriber’s name and write it to undo log with opposite actions.

select subname

from

    pg_stat_subscription p

where

    p.pid = pg_backend_pid()

On the publication side there is no possibility to find out who consumes the slot. As we don’t know links between subscriber and publisher — we need to make an external list with logical consumers ( for londiste we do the same thing), to apply undo if the source is crashed.
So the consumer name has to be in special format. This will be useful to find out the link between publication and subscription.

REDO — reposition source (subscriber’s crash)

REDO in brief

1. The Infrastructure for this use-case is as follows:

Provider(main) — is a publisher — the source of our logical replication.
Subscriber’s master (repca1) — logical consumer.
Subscriber’s standby (repca2) — logical consumer’s standby.

There are 2 kinds of replication:

logical replication between the provider (the main one which is the source of the logical replication) and subscriber’s master (repca1);
streaming replication between subscriber’s master (repca1) and subscriber’s standby (repca2).

If the subscriber’s primary crashes, there is a probability that the subscriber’s standby will fall behind the crashed subscriber’s master, it might occur when the replication is set up in an asynchronous mode. In this case a logical slot’s position is not at the LSN, which subscriber’s standby expects i.e. we need a different logical slot (it can be called a slot for a failover) which tracks the progress of applying the logical changes on subscriber’s standby.

Now let’s look at the main schema of that case on the images below:

Step 1: the crash of subscriber’s master

Step2: promoting subscriber’s standby

Step3: changing the queue position according to destination

Step4: Start the replaying of the logical replication

Full implementation of Redo case — subscriber’s crash and promoting subscriber’s standby with more examples of data spreading

The command for moving a replication slot. We create the slot manually for the promoted consumer’s standby to prevent replicated queue (WAL) rotation on provider’s side:

psql -p 5433 -U postgres -X -d src

-c “

select * from pg_logical_slot_get_binary_changes(

‘repca2’::name,

‘0/38AFCC0’::pg_lsn,

    null::int,

    variadic array[‘proto_version’, ‘1’, ‘publication_names’, ‘pub’]

)”

Creating a logical slot to prevent WAL’s rotation on provider’s side(these WAL files can be needed for promoted subscriber’s standby).

2. Adding new changes for our logical consumer:

3. Replication slot for our subscriber’s standby is in the past:

4. Checking pg_replication_origin status on subscriber’s side and subscriber’s standby side:

5. “Moving” replication slot with the help of SQL protocol

6. LSN for both replication slots are equal.

7. “Emulating delay of subscriber’s standby replication”.

8. Adding one more record in the replicated table.

9. As expected subscriber’s standby falls behind.

10. “Subscriber’s crash”. Subscriber’s standby is still behind.

11. Dropping slot which was consumed by crashed subscriber’s primary.

12. Checking pg_replication_origin_status.

13. Sync “subscriber’s standby slot” actual with subscriber’s standby pg_replication_origin.

14. Subscriber hasn’t had new changes yet.

15. Promoting subscriber’s standby.

16. Promoted subscriber’s standby is still behind

17. Alter subscription: set actual slot, we prepared previously.

18. Slot turned on and started being consumed by subscriber.

19. Subscriber replayed the changes.

The algorithm for consuming a reserved slot

Take the LSN position for the subscriber’s master’s slot from pg_replication_slots.

2. Check pg_replication_origin on the subscriber’s master.

3. Wait until the subscriber’s master’s pg_replication_origin is seen on the subscriber’s standby side.

4. Consume the reserved slot (subscriber’s standby slot):

select * from pg_logical_slot_get_binary_changes(‘repca2’::name,
‘0/3037B68’::pg_lsn, null::int, variadic array[‘proto_version’, ‘1’, ‘publication_names’, ‘pub’])

REDO 2 — on provider’s side (provider’s crash and switching to the provider’s standby, subscriber is falling behind)

1 The Infrastructure for this use-case is as follows:

Provider (main) is a publisher — the source of our logical replication.
Provider’s standby.
Replica (repca1) logical consumer.

There is a master with its standby (the source of logical replication). If the primary crashes, there is a probability that the repca (logical consumer) will fall behind the provider, i.e. logical replication might work slower than binary due to many reasons:

by design many backends can make data changes on the provider side and only one backend replay all data changes from the provider on the subscriber side,
subscriber might be overloaded,
there can be queries on the subscriber that lock some replicated relations and prevent applying the logical changes,
etc.

In trigger based solution (Londiste in my case) we do nothing with that case, i.e. we just promote the standby, make some changes in the config file and continue replaying the logical replication. It is possible because the replication queue and the replication progress are stored in tables (PgQ and Londiste tables) which are also replicated to binary provider’s standby. On the other hand, in Logical Replication in PostgreSQL 10 WAL files, pg_catalog and a logical replication slot are used for these purposes.

Let me give you a simple example with a potential data loss in some production system. Imagine a system, where a table with orders is replicated. On the subscriber’s side there is a trigger, which fires on any order table change and writes all these states of each row to the audit table. Such solutions are used for audit/analytics/other purposes. If there is a crash like the one I have described above, even if you have all WAL files on your provider’s standby, you can’t get logical changes for the subscriber, so there is a gap in your data.

In PostgreSQL 10 implementation of the logical replication a logical slot is not replicated to the provider’s standby. We can’t create a logical replication slot on the provider’s standby and there is no other way today to continue our replication without a data loss on the provider’s side.

A range of large technology companies in Russia face the same problem:

Avito, the biggest classified site of Russia,
Yandex, the largest technology company in Russia (there was a mentioning of the same problem in the talk https://www.youtube.com/watch?v=2QJgCTF1cIE),
a database architect from Tinkoff bank, one of the largest Russian banks.

This problem is very important, I see community activity in developing solution for it, e.g.

Conclusions

The short list above compares tools which we use to recover and make data consistent after crashes in distributed systems, built with the help of Logical Replication.

My thoughts about (desirable features)

I would be on cloud nine if there were an api and commands to implement all the recoverу cases we discussed above as simply as setting up logical replication in PostgreSQL 10 :)

(1) Reinitializing subscriber from another subscriber

Make pg_replication_origin “transactional” ~ “pin to snapshot”, to prevent stopping Logical Replication process for Reinitializing subscriber from another subscriber case
Dumping with pg_dump replication_origin or option to enable it will make it easier to implement Reinitializing subscriber from another subscriber case

(2) UNDO recovery on the destination side

Implement “logical UNDO” / or SQL API

(3) REDO reposition source (subscriber’s crash)

Tracking the progress of consuming a logical slot on the subscriber (“provider_restart_lsn”) to make it easier to implement Redo case (the algorithm for consuming a reserved slot).
Comparing LSN between subscriber and provider (SerialConsumer). Raise error when provider state does not match subscriber to prevent silent data corruptions, like the ones I’ve described above in Undo and Redo 1 and 2 cases.
A function to move a slot forward on provider (not get the changes, just move, waiting for PostgreSQL 11 pg_replication_slot_advance (slot_name name, upto_lsn pg_lsn) )

(4) REDO 2 — on provider’s side (provider’s crash and switching to the provider’s standby, subscriber is falling behind)

Any implementation of failover slot or another mechanism to get logical changes for the subscriber after provider’s failover. It’s needed for two cases:

— In the Redo 2 on provider’s side (Subscriber in the past) the absence of a failover slot blocks the production usage of logical replication in PostgreSQL 10 at least in our case

— Undo — to prevent losing new changes for the logical subscriber we need to create a logical slot before letting clients connect to promoted provider. It can be solved today by following the steps described above.

Event tracking

Implementing a transactional queue (like PgQ) can be done with using a proxy table or writing a custom logical decoder and using pg_logical_emit_message.

I and my colleagues are hoping to see the solution for it in future PostgreSQL versions. Logical replication in PostgreSQL 10 is a great achievement, it is very simple today to set up a logical replication with high performance. This article is written to highlight main demands of Russian PostgreSQL community (I suppose that members of PostgreSQL community worldwide face the same problems).

Recovery use cases for Logical Replication in PostgreSQL 10 was originally published in AvitoTech on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Regina Obe: PostgresVision 2018 Slides and Impressions

June 9, 2018, 1:56 am

≫ Next: Regina Obe: Unpivoting data using JSON functions

≪ Previous: Konstantin Evteev: Recovery use cases for Logical Replication in PostgreSQL 10

Leo and I attended PostgresVision 2018 which ended a couple of days ago.

We gave a talk on spatial extensions with main focus being PostGIS. Here are links to our slides PostgresVision2018_SpatialExtensions HTML version PDF.

Unfortunately there are no slides of the pgRouting part, except the one that says PGRouting Live Demos because Leo will only do live demos. He has no fear of his demos not working.

Side note, if you are on windows and use the PostGIS bundle, all the extensions listed in the PostGIS box of the spatial extensions diagram, as well as the pointcloud, pgRouting, and ogr_fdw are included in the bundle.

Continue reading "PostgresVision 2018 Slides and Impressions"

↧

Regina Obe: Unpivoting data using JSON functions

June 14, 2018, 12:12 am

≫ Next: REGINA OBE: Coming PostGIS 2.5 ST_OrientedEnvelope

≪ Previous: Regina Obe: PostgresVision 2018 Slides and Impressions

Most of our use-cases for the built-in json support in PostgreSQL is not to implement schemaless design storage, but instead to remold data. Remolding can take the form of restructuring data into json documents suitable for web maps, javascript charting web apps, or datagrids. It also has uses beyond just outputting data in json form. In addition the functions are useful for unraveling json data into a more meaningful relational form.

One of the common cases we use json support is what we call UNPIVOTING data. We demonstrated this in Postgres Vision 2018 presentation in slide 23. This trick won't work in other relational databases that support JSON because it also uses a long existing feature of PostgreSQL to be able to treat a row as a data field.

Continue reading "Unpivoting data using JSON functions"

↧

REGINA OBE: Coming PostGIS 2.5 ST_OrientedEnvelope

June 18, 2018, 2:42 pm

≫ Next: Joshua Tolley: Systematic Query Building with Common Table Expressions

≪ Previous: Regina Obe: Unpivoting data using JSON functions

PostGIS 2.5 is just around the corner. One of the new functions coming is the ST_OrientedEnvelop. This is something I've been waiting for for years. It is the true minimum bounding rectangle, as opposed to ST_Envelope which is an axis aligned bounding rectable.

Below is a pictorial view showing the difference between the two.

Continue reading "Coming PostGIS 2.5 ST_OrientedEnvelope"

↧

Joshua Tolley: Systematic Query Building with Common Table Expressions

June 11, 2018, 5:00 pm

≫ Next: Joshua Otwell: Using pg_dump and pg_dumpall to Backup PostgreSQL

≪ Previous: REGINA OBE: Coming PostGIS 2.5 ST_OrientedEnvelope

The first time I got paid for doing PostgreSQL work on the side, I spent most of the proceeds on the mortgage (boring, I know), but I did get myself one little treat: a boxed set of DVDs from a favorite old television show. They became part of my evening ritual, watching an episode while cleaning the kitchen before bed. The show features three military draftees, one of whom, Frank, is universally disliked. In one episode, we learn that Frank has been unexpectedly transferred away, leaving his two roommates the unenviable responsibility of collecting Frank’s belongings and sending them to his new assignment. After some grumbling, they settle into the job, and one of them picks a pair of shorts off the clothesline, saying, “One pair of shorts, perfect condition: mine,” and he throws the shorts onto his own bed. Picking up another pair, he says, “One pair of shorts. Holes, buttons missing: Frank’s.”

The other starts on the socks: “One pair of socks, perfect condition: mine. One pair socks, holes: Frank’s. You know, this is going to be a lot easier than I thought.”

“A matter of having a system,” responds the first.

I find most things go better when I have a system, as a recent query writing task made clear. It involved data from the Instituto Nacional de Estadística y Geografía, or INEGI, an organization of the Mexican government tasked with collecting and managing country-wide statistics and geographical information. The data set contained the geographic outline of each city block in Mexico City, along with demographic and statistical data for each block: total population, a numeric score representing average educational level, how much of the block had sidewalks and landscaping, whether the homes had access to the municipal sewer and water systems, etc. We wanted to display the data on a Liquid Galaxy in some meaningful way, so I loaded it all in a PostGIS database and built a simple visualization showing each city block as a polygon extruded from the earth, with the height and color of the polygon proportional to the educational score for that block, compared to the city-wide average.

It wasn’t entirely a surprise that with so many polygons, rendering performance suffered a bit, but most of all the display was just plain confusing. This image is just one of Mexico City’s 16 boroughs.

With so much going on in the image, it’s difficult for the user to extract any meaningful information. So I turned to a technique we’d used in the past: reprocess the geographical area into grid squares, extrapolate the statistic of interest over the area of the square, and plot it again as a set of squares. The result is essentially a three dimensional heat map, much easier to comprehend, and, incidentally, to render.

As with most programming tasks, it’s helpful to have a system, so I started by sketching out how exactly to produce the desired result. I planned to overlay the features in the data set with a grid, and then for each square in the grid, find all intersecting city blocks, a number representing the educational level of residents of that block, and what percentage of the block’s total area intersects each grid square. From that information I can extrapolate that block’s contribution to the grid square’s total educational score. The precise derivation of the score isn’t important for our purposes here; suffice it to say it’s a numeric value with no particular associated unit, whose value lies in its relation to the scores of other blocks. Residents of a block with a high score are, on average, probably more educated than residents of a lower-scoring block. For this query, a block with an average educational level of, say, 100 “points”, would contribute all 100 points to a grid square if the entire block lay within that square, 60 points if only 60% of it was within the square, and so on. In the end, I should be able to add up all the scores for each grid square, rank them against all other grid squares, and produce a visualization.

I suffer from the decidedly masochistic habit of doing whatever I can in a single query, while maintaining a desire for readable and maintainable code. Cramming everything into one query isn’t always a good technique, as I hope to illustrate in a future blog post, but it worked well enough in this instance, and provides a good example I wanted to share, of one way to use Common Table Expressions. They do for SQL what subroutines do for other languages, separating tasks into distinct units. A common table expression looks like this:

WITH alias AS (
    SELECT something FROM wherever
),
another_alias AS (
    SELECT another_thing FROM alias LEFT JOIN something_else
)
SELECT an, assortment, of, fields
FROM another_alias;

As shown by this example query, the WITH keyword precedes a list of named, parenthesized queries, each of which functions throughout the life of the query as though it were a full-fledged table. These pseudo-tables are called Common Table Expressions, and they allow me to make one table for each distinct function in what will prove to be a fairly complicated query.

Let’s go through the elements of our new query piece by piece. First, I want to store the results of the query, so I can create different visualizations without recalculating everything. So I’ll make this query create a new table filled with its results. In this case, I called the table grid_mza_vals.

Now, for my first CTE. This query involves a few user-selected parameters, and I want an easy way to adjust these settings as I experiment to get the best results. I’ll want to fiddle with the number of grid squares in the overall result, as well as the coefficients used later on to calculate the height of each polygon. So my first CTE is called simply params, and returns a single row, composed of these parameters.

CREATE TABLE grid_mza_vals AS
WITH params AS (
    SELECT
        25 AS numsq,
        3800 AS alt_bias,
        500 AS alt_percfactor
),

The numsq value represents the number of grid squares along one edge of my overall grid; we’ll discuss the other values later. I’ve chosen a relatively small number of total grid squares for faster processing while building the rest of the query. I can make it more detailed, if I want, after everything else works.

The next thing I want is a sequence of numbers from 1 to numsq:

range AS (
    SELECT GENERATE_SERIES(0, numsq - 1) AS rng
    FROM params
),

Now I can join the range CTE with itself, to get the coordinates for each grid square:

gridix AS (
    SELECT
        x.rng AS x_ix,
        y.rng AS y_ix
    FROM range x, range y
),

Occasionally I like to check my progress, running whatever bits of the query I’ve already written to review its behavior. CTEs make this convenient, because I can adjust the final clause of the query to select data from whichever of the CTEs I’m currently interested in. Here’s the query thus far:

inegi=# WITH params AS (
    SELECT
        25 AS numsq,
        3800 AS alt_bias,
        500 AS alt_percfactor
),
range AS (
    SELECT GENERATE_SERIES(0, numsq - 1) AS rng
    FROM params
),
gridix AS (
    SELECT
        x.rng AS x_ix,
        y.rng AS y_ix
    FROM range x, range y
)
SELECT * FROM gridix;

 x_ix | y_ix
------+------
    0 |    0
    0 |    1
    0 |    2
    0 |    3
    0 |    4
    0 |    5
    0 |    6
    0 |    7
    0 |    8
    0 |    9
    0 |   10
    0 |   11
    0 |   12
    0 |   13
...

So far, so good. The gridix CTE returns coordinates for each cell in the grid, from zero to the numsq value from my params CTE. From those coordinates, if I know the geographic boundaries of the data set and the number of squares in each edge, I can calculate the latitude and longitude of the four corners of each grid square. First, I need to find the geographic boundaries of the data. My dataset lives in a table called manzanas, Spanish for “city block” (and also “apple”). Each row contains one geographic attribute containing a polygon defining the boundaries of the block, and several other attributes such as the education score I mentioned. In PostGIS there are several different ways to find the bounding box I want; here’s the one I used.

limits AS (
    SELECT
        MIN(ST_XMin(geom)) AS xmin,
        MAX(ST_XMax(geom)) AS xmax,
        MIN(ST_YMin(geom)) AS ymin,
        MAX(ST_YMax(geom)) AS ymax
    FROM
        manzanas
),

And, just to check my results:

inegi=#     SELECT
        MIN(ST_XMin(geom)) AS xmin,
        MAX(ST_XMax(geom)) AS xmax,
        MIN(ST_YMin(geom)) AS ymin,
        MAX(ST_YMax(geom)) AS ymax
    FROM
        manzanas;
     xmin      |     xmax     |       ymin       |       ymax
---------------+--------------+------------------+------------------
 -99.349658451 | -98.94668802 | 19.1241898199991 | 19.5863775499992
(1 row)

So the data in question extend from about 99.35 to 98.95 west longitude, and 19.12 to 19.59 north latitude. Now I’ll calculate the boundaries of each grid square, compose a text representation for the square in Well-Known Text format, and convert the text to a PostGIS geometry object. There’s another way I could do this, much simpler and probably much faster, which I hope to detail in a future blog post, but this will do for now.

gridcoords AS (
    SELECT
        (xmax - xmin) / numsq * x_ix + xmin AS x0,
        (xmax - xmin) / numsq * (x_ix + 1) + xmin AS x1,
        (ymax - ymin) / numsq * y_ix + ymin AS y0,
        (ymax - ymin) / numsq * (y_ix + 1) + ymin AS y1,
        x_ix || ',' || y_ix AS grid_ix
    FROM limits, params, gridix
),
gridewkt AS (
    SELECT
        grid_ix,
        'POLYGON((' ||
        x0 || ' ' || y0 || ',' ||
        x1 || ' ' || y0 || ',' ||
        x1 || ' ' || y1 || ',' ||
        x0 || ' ' || y1 || ',' ||
        x0 || ' ' || y0 || '))' AS ewkt
    FROM gridcoords
),
gridgeom AS (
    SELECT
        grid_ix, ewkt,
        ST_GeomFromText(ewkt, 4326) AS geom
    FROM gridewkt
),

And again, I’ll check the result by SELECTing grid_ix, ewkt, and geom, from the first row of the gridgeom CTE.

-[ RECORD 1 ]-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
grid_ix | 0,0
ewkt    | POLYGON((-99.349658451 19.1241898199991,-99.34562874669 19.1241898199991,-99.34562874669 19.1288116972991,-99.349658451 19.1288116972991,-99.349658451 19.1241898199991))
geom    | 0103000020E6100000010000000500000029F4D6CD60D658C06B646FE7CA1F3340D3E508C81ED658C06B646FE7CA1F3340D3E508C81ED658C0EC3DABCDF920334029F4D6CD60D658C0EC3DABCDF920334029F4D6CD60D658C06B646FE7CA1F3340

I won’t claim an ability to translate the geometry object as represented above, but the ewkt value looks correct, so let’s keep going. Next I need to cut the blocks into pieces, corresponding to the parts of each block that belong in a single grid square. So a block lying entirely in one square will return one row in this next query; a block that intersects two squares will return two rows. Each row will include the geometries of the block, the square, and the intersection of the two, the total area of the block, the area of the intersection, and the total educational score for the block.

block_part AS (
    SELECT
        grid_ix,                     -- Grid square coordinates, e.g. 0,0
        m.gid AS manz_gid,           -- Manzana identifier
        ggm.geom,                    -- Grid square geometry
        m.graproes::FLOAT,           -- Educational score
        ST_Intersection(m.geom, ggm.geom)
            AS manz_bg_int,          -- Geometric intersection between grid square and city block
        ST_Area(m.geom)
            AS manz_area,            -- Area of the city block
        ST_Area(ST_Intersection(m.geom, ggm.geom)) / ST_Area(m.geom)
            AS manz_area_perc        -- Area of the intersection
    from
        gridgeom ggm
        JOIN manzanas m
            ON (ST_Intersects(m.geom, ggm.geom))
    WHERE
        m.graproes IS NOT NULL       -- Skip null educational scores
),

Here’s a sample of those results:

-[ RECORD 1 ]--+---------------------
grid_ix        | 15,16
manz_gid       | 61853
geom           | 0103000020E610000...
graproes       | 11
manz_bg_int    | 0103000020E610000...
manz_area      | 3.26175061395103e-06
manz_area_perc | 0.149256808754461
-[ RECORD 2 ]--+---------------------
grid_ix        | 15,17
manz_gid       | 61853
geom           | 0103000020E610000...
graproes       | 11
manz_bg_int    | 0103000020E610000...
manz_area      | 3.26175061395103e-06
manz_area_perc | 0.850743191246843

These results, which I admit to having selected with some care, show a single block, number 61853, which lies across the border between two grid squares. Now we’ll calculate the education score for each block fragment, and then divide the fragments into groups based on the grid square in which they belong, and aggregate the results. I did this in two separate CTEs.

grid_calc AS (
    SELECT
        grid_ix,
        geom,
        graproes * manz_area_perc AS grid_graproes,
        manz_area_perc
    FROM
        block_part
),
grid_accum AS (
    SELECT
        grid_ix,
        geom,
        SUM(grid_graproes) AS sum_graproes
    FROM grid_calc
    GROUP BY grid_ix, geom
),

This latest CTE gives results such as these:

-[ RECORD 1 ]+-----------------
grid_ix      | 12,14
geom         | 0103000020E61...
sum_graproes | 3888.53630440106

Now we’re left with turning these results into a visualization. I’d like to assign each polygon a height, and a color. To make the visualization easier to understand, I’ll divide the results into a handful of classes, and assign a height and color to each class.

Using an online color palette generator, I came up with a sequence of six colors, which progress from white to a green similar to the green bar on the Mexican flag. Another CTE will return these colors as an array, and yet another will assign the grid squares to groups based on their calculated score. Finally, a third will select the proper color from the array using that group value. At this point, readers are probably thinking “enough already; quit dividing everything into ever smaller CTEs”, to which I can only say, “Yeah, you may be right.”

colors AS (
    SELECT ARRAY['ffffff','d4ead7','a9d6af','7ec188','53ad60','299939'] AS colors
),
colorix AS (
    SELECT *,
        NTILE(ARRAY_LENGTH(colors, 1)) OVER (ORDER BY sum_graproes asc) AS edu_colorix
    FROM grid_accum, colors
),
color AS (
    SELECT *,
        colors[edu_colorix] as edu_color
    from colorix
)

The ntile() window function is useful for this kind of thing. It divides the given partition into buckets, and returns the number of the bucket for each row. Here, the partition consists of the whole data set; we sort it by educational score to ensure low-scoring grid squares get low-numbered buckets. Note also that I can change the colors, adding or removing groups, simply by adjusting the colors CTE. This could theoretically prove handy, if I decided I didn’t like the number of levels or the color scheme, but it’s a feature I never used for this visualization.

We’re on the home stretch, at last, and I should clarify how I plan to turn the database objects into KML, usable on a Liquid Galaxy. I used ogr2ogr from the GDAL toolkit. It converts between a number of different GIS data sources, including PostGIS to KML. I need to feed it the geometry I want to draw, as well as styling instructions and, in this case, a custom KML altitudeMode.

Styling is an involved topic; for our purposes it’s enough to say that I’ll tell ogr2ogr to use our selected color both to draw the lines of our polygons, and to fill them in. But moving the grid square’s geometry to an altitude corresponding to its educational score is fairly easy, using PostGIS’s ST_Force3DZ() function to add to the hitherto two-dimensional polygon a zero-valued third dimension, and ST_Translate() to move it above the surface of the earth a ways. So I can probably finish this with one final query:

-- Insert all previous CTEs here
SELECT
    ST_Translate(
        ST_Force3DZ(geom), 0, 0,
        alt_bias + edu_colorix * alt_percfactor
    ) AS edu_geom,
    'BRUSH(fc:#' || edu_color || 'ff);PEN(c:#' || edu_color || 'ff)' AS edu_style,
    'absolute' AS "altitudeMode"
FROM color, params;

You may remember alt_bias and alt_percfactor, the oddly named and thus far unexplained values in my first params CTE. These I used to control how far apart in altitude one group of polygons is from another, and to bias them all far enough above the ground to avoid the problem of them being obscured by terrain features. You may also remember that this query began with the CREATE TABLE grid_mza_vals AS... command, meaning that we’ll store the results of all this processing in a table, so ogr2ogr can get to it. We call ogr2ogr like this:

ogr2ogr \
    -f LIBKML education.kml \
    PG:"dbname=inegi user=josh password=<redacted>" \
    -sql "SELECT grid_ix, edu_geom, edu_style as \"OGR_STYLE\", \"altitudeMode\" FROM grid_mza_vals"

OGR’s LIBKML driver knows an attribute called “OGR_STYLE” is a style string, and one called “altitudeMode” is, predictably, the feature’s altitude mode. So this will create a KML file, containing one placemark for each row in our grid_mza_tables table. Each placemark consists of a colored square, floating in the air above Mexico City. The squares come in six different levels and six different colors, corresponding to our original education data. Something like this:

Here’s one of the placemarks from the KML.

<Placemark id="sql_statement.1">
        <Style>
          <LineStyle>
            <color>ffffffff</color>
            <width>1</width>
          </LineStyle>
          <PolyStyle>
            <color>ffffffff</color>
          </PolyStyle>
        </Style>
        <ExtendedData>
          <SchemaData schemaUrl="#sql_statement.schema">
            <SimpleData name="grid_ix">11,10</SimpleData>
            <SimpleData name="OGR_STYLE">BRUSH(fc:#ffffffff);PEN(c:#ffffffff)</SimpleData>
          </SchemaData>
        </ExtendedData>
        <Polygon>
          <altitudeMode>absolute</altitudeMode>
          <outerBoundaryIs>
            <LinearRing>
              <coordinates>
                -99.17235146136,19.3090649119991,4300
                -99.15623264412,19.3090649119991,4300
                -99.15623264412,19.3275524211991,4300
                -99.17235146136,19.3275524211991,4300
                -99.17235146136,19.3090649119991,4300
              </coordinates>
            </LinearRing>
          </outerBoundaryIs>
        </Polygon>
      </Placemark>

This may be sufficient, but it gets confusing when viewed from a low angle, because it’s hard to tell which square belongs to which part of the map. I prefer having the polygons “extruded” from the ground, as KML terms it. I used a simple Perl script to add the extrude element to each polygon in the KML, resulting in this:

This query works, but leaves a few things to be desired. For instance, doing everything in one query is probably not the best option when lots of processing is involved. Ideally we would calculate the grid geometries once, and save them in a table somewhere, for quicker processing as we build the rest of the query and experiment with visulization options. Second, PostGIS provides an arguably more elegant way of finding grid squares in the first place, a method which also affords other options for interesting visualizations. Stay tuned for a future blog post discussing these issues. Meanwhile, CTEs proved a valuable way to systematize and modularize a very complicated query.

↧

Joshua Otwell: Using pg_dump and pg_dumpall to Backup PostgreSQL

June 20, 2018, 7:32 am

≫ Next: Federico Campoli: Modern times

≪ Previous: Joshua Tolley: Systematic Query Building with Common Table Expressions

Businesses and services deliver value based on data. Availability, consistent state, and durability are top priorities for keeping customers and end-users satisfied. Lost or inaccessible data could possibly equate to lost customers.

Database backups should be at the forefront of daily operations and tasks.

We should be prepared for the event that our data becomes corrupted or lost.

I'm a firm believer in an old saying I’ve heard: "It's better to have it and not need it than to need it and not have it."

That applies to database backups as well. Let's face it, without them, you basically have nothing. Operating on the notion that nothing can happen to your data is a fallacy.

Most DBMS's provide some means of built-in backup utilities. PostgreSQL has pg_dump and pg_dumpall out of the box.

Both present numerous customization and structuring options. Covering them all individually in one blog post would be next to impossible. Instead, I'll look at those examples I can apply best, to my personal development/learning environment.

That being said, this blog post is not targeted at a production environment. More likely, a single workstation/development environment should benefit the most.

What are pg_dump and pg_dumpall?

The documentation describes pg_dump as: “pg_dump is a utility for backing up a PostgreSQL database”

And the pg_dumpall documentation: “pg_dumpall is a utility for writing out (“dumping”) all PostgreSQL databases of a cluster into one script file.”

Backing up a Database and/or Table(s)

To start, I'll create a practice database and some tables to work with using the below SQL:

postgres=# CREATE DATABASE example_backups;
CREATE DATABASE

example_backups=# CREATE TABLE students(id INTEGER,
example_backups(# f_name VARCHAR(20),
example_backups(# l_name VARCHAR(20));
CREATE TABLE

example_backups=# CREATE TABLE classes(id INTEGER,
example_backups(# subject VARCHAR(20));
CREATE TABLE

example_backups=# INSERT INTO students(id, f_name, l_name)
example_backups-# VALUES (1, 'John', 'Thorn'), (2, 'Phil', 'Hampt'),
example_backups-# (3, 'Sue', 'Dean'), (4, 'Johnny', 'Rames');
INSERT 0 4

example_backups=# INSERT INTO classes(id, subject)
example_backups-# VALUES (1, 'Math'), (2, 'Science'),
example_backups-# (3, 'Biology');
INSERT 0 3

example_backups=# \dt;
         List of relations
Schema |   Name | Type  | Owner
--------+----------+-------+----------
public | classes  | table | postgres
public | students | table | postgres
(2 rows)

example_backups=# SELECT * FROM students;
id | f_name | l_name
----+--------+--------
 1 | John   | Thorn
 2 | Phil   | Hampt
 3 | Sue    | Dean
 4 | Johnny | Rames
(4 rows)

example_backups=# SELECT * FROM classes;
id | subject
----+---------
 1 | Math
 2 | Science
 3 | Biology
(3 rows)

Database and tables all set up.

To note:

In many of these examples, I'll take advantage of psql's \! meta-command, allowing you to either drop into a shell (command-line), or execute whatever shell commands that follow.

Just be aware that in a terminal or command-line session (denoted by a leading '$' in this blog post), the \! meta-command should not be included in any of the pg_dump or pg_dumpall commands. Again, it is a convenience meta-command within psql.

Backing up a single table

In this first example, I'll dump the only the students table:

example_backups=# \! pg_dump -U postgres -t students example_backups > ~/Example_Dumps/students.sql.

Listing out the directory's contents, we see the file is there:

example_backups=# \! ls -a ~/Example_Dumps
.  .. students.sql

The command-line options for this individual command are:

-U postgres: the specified username
-t students: the table to dump
example_backups: the database

What's in the students.sql file?

$ cat students.sql
--
-- PostgreSQL database dump
--
-- Dumped from database version 10.4 (Ubuntu 10.4-2.pgdg16.04+1)
-- Dumped by pg_dump version 10.4 (Ubuntu 10.4-2.pgdg16.04+1)
SET statement_timeout = 0;
SET lock_timeout = 0;
SET idle_in_transaction_session_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SELECT pg_catalog.set_config('search_path', '', false);
SET check_function_bodies = false;
SET client_min_messages = warning;
SET row_security = off;
 
SET default_tablespace = '';
 
SET default_with_oids = false;
 
--
-- Name: students; Type: TABLE; Schema: public; Owner: postgres
--
CREATE TABLE public.students (
   id integer,
   f_name character varying(20),
   l_name character varying(20)
);
 
ALTER TABLE public.students OWNER TO postgres;
 
--
-- Data for Name: students; Type: TABLE DATA; Schema: public; Owner: postgres
--
COPY public.students (id, f_name, l_name) FROM stdin;
1 John Thorn
2 Phil Hampt
3 Sue Dean
4 Johnny Rames
\.
--
-- PostgreSQL database dump complete

We can see the file has the necessary SQL commands to re-create and re-populate table students.

But, is the backup good? Reliable and working?

We will test it out and see.

example_backups=# DROP TABLE students;
DROP TABLE

example_backups=# \dt;
         List of relations
Schema |  Name | Type  | Owner
--------+---------+-------+----------
public | classes | table | postgres
(1 row)

It's gone.

Then from the command-line pass the saved backup into psql:

$ psql -U postgres -W -d example_backups -f ~/Example_Dumps/students.sql
Password for user postgres:
SET
SET
SET
SET
SET
set_config
------------
(1 row)
 
SET
SET
SET
SET
SET
CREATE TABLE
ALTER TABLE
COPY 4

Let's verify in the database:

example_backups=# \dt;
         List of relations
Schema |   Name | Type  | Owner
--------+----------+-------+----------
public | classes  | table | postgres
public | students | table | postgres
(2 rows)

example_backups=# SELECT * FROM students;
id | f_name | l_name
----+--------+--------
 1 | John   | Thorn
 2 | Phil   | Hampt
 3 | Sue    | Dean
 4 | Johnny | Rames
(4 rows)

Table and data have been restored.

Backing up multiple tables

In this next example, we will back up both tables using this command:

example_backups=# \! pg_dump -U postgres -W -t classes -t students -d example_backups > ~/Example_Dumps/all_tables.sql
Password:

(Notice I needed to specify a password in this command due to the -W option, where I did not in the first example. More on this to come.)

Let's again verify the file was created by listing out the directory contents:

example_backups=# \! ls -a ~/Example_Dumps
.  .. all_tables.sql  students.sql

Then drop the tables:

example_backups=# DROP TABLE classes;
DROP TABLE
example_backups=# DROP TABLE students;
DROP TABLE
example_backups=# \dt;
Did not find any relations.

Then restore with the all_tables.sql backup file:

$ psql -U postgres -W -d example_backups -f ~/Example_Dumps/all_tables.sql
Password for user postgres:
SET
SET
SET
SET
SET
set_config
------------
(1 row)
 
SET
SET
SET
SET
SET
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
COPY 3
COPY 4

example_backups=# \dt;
         List of relations
Schema |   Name | Type  | Owner
--------+----------+-------+----------
public | classes  | table | postgres
public | students | table | postgres
(2 rows)

Both tables have been restored.

As we can see with pg_dump, you can back up just one, or multiple tables within a specific database.

Backing up a database

Let's now see how to backup the entire example_backups database with pg_dump.

example_backups=# \! pg_dump -U postgres -W -d example_backups > ~/Example_Dumps/ex_back_db.sql
Password:
 
example_backups=# \! ls -a ~/Example_Dumps
.  .. all_tables.sql  ex_back_db.sql students.sql

The ex_back_db.sql file is there.

I'll connect to the postgres database in order to drop the example_backups database.

postgres=# DROP DATABASE example_backups;
DROP DATABASE

Then restore from the command-line:

$ psql -U postgres -W -d example_backups -f ~/Example_Dumps/ex_back_db.sql
Password for user postgres:
psql: FATAL:  database "example_backups" does not exist

It's not there. Why not? And where is it?

We have to create it first.

postgres=# CREATE DATABASE example_backups;
CREATE DATABASE

Then restore with the same command:

$ psql -U postgres -W -d example_backups -f ~/Example_Dumps/ex_back_db.sql
Password for user postgres:
SET
SET
SET
SET
SET
set_config
------------
(1 row)
 
SET
SET
SET
CREATE EXTENSION
COMMENT
SET
SET
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
COPY 3
COPY 4

postgres=# \c example_backups;
You are now connected to database "example_backups" as user "postgres".
example_backups=# \dt;
         List of relations
Schema |   Name | Type  | Owner
--------+----------+-------+----------
public | classes  | table | postgres
public | students | table | postgres
(2 rows)

Database and all tables present and accounted for.

We can avoid this scenario of having to create the target database first, by including the -C option when taking the backup.

example_backups=# \! pg_dump -U postgres -W -C -d example_backups > ~/Example_Dumps/ex_back2_db.sql
Password:

I'll reconnect to the postgres database and drop the example_backups database so we can see how the restore works now (Note those connect and DROP commands not shown for brevity).

Then on the command-line (notice no -d dbname option included):

$ psql -U postgres -W -f ~/Example_Dumps/ex_back2_db.sql
Password for user postgres:
……………..
(And partway through the output...)
CREATE DATABASE
ALTER DATABASE
Password for user postgres:
You are now connected to database "example_backups" as user "postgres".
SET
SET
SET
SET
SET
set_config
------------
(1 row)
 
SET
SET
SET
CREATE EXTENSION
COMMENT
SET
SET
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
COPY 3
COPY 4

Using the -C option, we are prompted for a password to make a connection as mentioned in the documentation concerning the -C flag:

“Begin the output with a command to create the database itself and reconnect to the created database.”

Then in the psql session:

postgres=# \c example_backups;
You are now connected to database "example_backups" as user "postgres".

Everything is restored, good to go, and without the need to create the target database prior to the restore.

pg_dumpall for the entire cluster

So far, we have backed up a single table, multiple tables, and a single database.

But if we want more than that, for instance backing up the entire PostgreSQL cluster, that's where we need to use pg_dumpall.

So what are some notable differences between pg_dump and pg_dumpall?

For starters, here is an important distinction from the documentation:

“Since pg_dumpall reads tables from all databases, you will most likely have to connect as a database superuser in order to produce a complete dump. Also, you will need superuser privileges to execute the saved script in order to be allowed to add users and groups and to create databases.”

Using the below command, I'll back up my entire PostgreSQL cluster and save it in the entire_cluster.sql file:

$ pg_dumpall -U postgres -W -f ~/Example_Dumps/Cluster_Dumps/entire_cluster.sql
Password:
Password:
Password:
Password:
Password:
Password:
Password:
Password:
Password:
Password:
Password:
Password:
Password:
Password:
Password:
Password:
Password:
Password:
Password:
Password:
Password:
Password:
Password:
Password:

What on earth? Are you wondering did I have to enter a password for each prompt?

Yep, sure did. 24 times.

Count 'em. (Hey, I like to explore and delve into different databases as I learn? What can I say?)

But why all the prompts?

First of all, after all that hard work, did pg_dumpall create the backup file?

postgres=# \! ls -a ~/Example_Dumps/Cluster_Dumps
.  .. entire_cluster.sql

Yep, the backup file is there.

Let's shed some light on all that 'typing practice’ by looking at this passage from the documentation:

“pg_dumpall needs to connect several times to the PostgreSQL server (once per database). If you use password authentication it will ask for a password each time.”

I know what you're thinking.

This may not be ideal or even feasible. What about processes, scripts, or cron jobs that run in the middle of the night?

Is someone going to hover over the keyboard, waiting to type?

Probably not.

One effective measure to prevent facing those repeated password prompts is a ~/.pgpass file.

Here is the syntax the ~/.pgpass file requires to work (example provided from the documentation see link above):

hostname:port:database:username:password

With a ~/.pgpass file present in my development environment, containing the necessary credentials for the postgres role, I can omit the -W (also -w) option and run pg_dumpall without manually authenticating with the password:

$ pg_dumpall -U postgres -f ~/Example_Dumps/Cluster_Dumps/entire_cluster2nd.sql

Listing out the directory contents:

postgres=# \! ls -a ~/Example_Dumps/Cluster_Dumps
.  .. entire_cluster2nd.sql  entire_cluster.sql

The file is created and no repeating password prompts.

The saved file can be reloaded with psql similar to pg_dump.

The connection database is less critical as well according to this passage from the documentation: ”It is not important to which database you connect here since the script file created by pg_dumpall will contain the appropriate commands to create and connect to the saved databases.”

PostgreSQL Management & Automation with ClusterControl

Learn about what you need to know to deploy, monitor, manage and scale PostgreSQL

Download the Whitepaper

pg_dump, pg_dumpall, and shell scripts - A handy combination

In this section, we will see a couple of examples of incorporating pg_dump and pg_dumpall into simple shell scripts.

The be clear, this is not a shell script tutorial. Nor am I a shell script guru. I'll mainly provide a couple of examples I use in my local development/learning environment.

Up first, let's look at a simple shell script you can use to backup a single database:

#!/bin/bash
# This script performs a pg_dump, saving the file the specified dir.
# The first arg ($1) is the database user to connect with.
# The second arg ($2) is the database to backup and is included in the file name.
# $(date +"%Y_%m_%d") includes the current system date into the actual file name.

pg_dump -U $1 -W -C -d $2 > ~/PG_dumps/Dump_Scripts/$(date +"%Y_%m_%d")_$2.sql

As you can see, this script accepts 2 arguments: the first one is the user (or role) to connect with for the backup, while the second is the name of the database you want to back up.

Notice the -C option in the command so that we can restore if the database happens to be non-existent, without the need to manually create it beforehand.

Let's call the script with the postgres role for the example_backups database (Don't forget to make the script executable with at least chmod +x prior to calling for the first time):

$ ~/My_Scripts/pgd.sh postgres example_backups
Password:

And verify it's there:

$ ls -a ~/PG_dumps/Dump_Scripts/
.  .. 2018_06_06_example_backups.sql

Restoration is performed with this backup script as in the previous examples.

A similar shell script can be used with pg_dumpall for backing up the entire PostgreSQL cluster.

This shell script will pipe (|) pg_dumpall into gzip, which is then directed to a designated file location:

#!/bin/bash
# This shell script calls pg_dumpall and pipes into the gzip utility, then directs to
# a directory for storage.
# $(date +"%Y_%m_%d") incorporates the current system date into the file name.
 
pg_dumpall -U postgres | gzip > ~/PG_dumps/Cluster_Dumps/$(date +"%Y_%m_%d")_pg_bck.gz

Unlike the previous example script, this one does not accept any arguments.

I'll call this script on the command-line, (no password prompt since the postgres role utilizes the ~/.pgpass file - See section above.)

$ ~/My_Scripts/pgalldmp.sh

Once complete, I'll list the directory contents also showing file sizes for comparison between the .sql and gz files:

postgres=# \! ls -sh ~/PG_dumps/Cluster_Dumps
total 957M
37M 2018_05_22_pg_bck.gz   32M 2018_06_06_pg_bck.gz 445M entire_cluster2nd.sql  445M entire_cluster.sql

A note for the gz archive format from the docs:

“The alternative archive file formats must be used with pg_restore to rebuild the database.”

Summary

I have assembled key points from the documentation on pg_dump and pg_dumpall, along with my observations, to close out this blog post:

Note: Points provided from the documentation are in quotes.

“pg_dump only dumps a single database”
The plain-text SQL file format is the default output for pg_dump.
A role needs the SELECT privilege to run pg_dump according to this line in the documentation: “pg_dump internally executes SELECT statements. If you have problems running pg_dump, make sure you are able to select information from the database using, for example, psql”
To include the necessary DDL CREATE DATABASE command and a connection in the backup file, include the -C option.
-W: This option forces pg_dump to prompt for a password. This flag is not necessary since if the server requires a password, you are prompted anyway. Nevertheless, this passage in the documentation caught my eye so I thought to include it here: “However, pg_dump will waste a connection attempt finding out that the server wants a password. In some cases it is worth typing -W to avoid the extra connection attempt.”
-d: Specifies the database to connect to. Also in the documentation: ”This is equivalent to specifying dbname as the first non-option argument on the command line.”
Utilizing flags such as -t (table) allows users to backup portions of the database, namely tables, they do have access privileges for.
Backup file formats can vary. However, .sql files are a great choice among others. Backup files are read back in by psql for a restore.
pg_dump can back up a running, active database without interfering with other operations (i.e., other readers and writers).
One caveat: pg_dump does not dump roles or other database objects including tablespaces, only a single database.
To take backups on your entire PostgreSQL cluster, pg_dumpall is the better choice.
pg_dumpall can handle the entire cluster, backing up information on roles, tablespaces, users, permissions, etc... where pg_dump cannot.
Chances are, a role with SUPERUSER privileges will have to perform the dump, and restore/recreate the file when it is read back in through psql because during restore, the privilege to read all tables in all databases is required.

My hope is through this blog post, I have provided adequate examples and details for a beginner level overview on pg_dump and pg_dumpall for a single development/learning PostgreSQL environments.

Although all available options were not explored, the official documentation contains a wealth of information with examples for both utilities so be sure and consult that resource for further study, questions, and reading.

Tags:

↧

Federico Campoli: Modern times

June 20, 2018, 5:00 pm

≫ Next: Robert Haas: Using force_parallel_mode Correctly

≪ Previous: Joshua Otwell: Using pg_dump and pg_dumpall to Backup PostgreSQL

The ansible is a fictional device made popular by the book Ender’s game written by Orson Scott Gards’s. The ansible allows instant communication between two points across the space, regardless of the distance.

It doesn’t surprise that the Red Hat’s ansible shares the same name as it gives a simple and efficient way to manage servers and automate tasks, yet remaining very lightweight with minimal requirements on the target machines.

↧

Robert Haas: Using force_parallel_mode Correctly

June 21, 2018, 7:47 am

≫ Next: Andrew Dunstan: Keeping our perl code clean

≪ Previous: Federico Campoli: Modern times

I admit it: I invented force_parallel_mode. I believed then, and still believe now, that it is valuable for testing purposes. Certainly, testing using force_parallel_mode=on or force_parallel_mode=regress has uncovered many bugs in PostgreSQL's parallel query support that would otherwise have been very difficult to find. At the same time, it's pretty clear that this setting has caused enormous confusion, even among PostgreSQL experts. In fact, in my experience, almost everyone who sets force_parallel_mode is doing so for the wrong reasons.

↧

Andrew Dunstan: Keeping our perl code clean

June 21, 2018, 8:04 am

≫ Next: Craig Kerstiens: Fun with SQL: Functions in Postgres

≪ Previous: Robert Haas: Using force_parallel_mode Correctly

Recently I have been refining and adding utilities to look after our Perl code. You might be surprised to learn that as well as 1.3 million or so lines of C code, there are about 30,000 lines of Perl code in our sources. This a sizeable body of code, even if it’s dwarfed by our C code. What does it do? Well, lots of things. It runs some very critical code in building from source, so the code to set up our catalogs is created by some Perl code. All the new data setup for catalogs is in fact Perl code. That’s another 20,000 lines or so of code on top of the 30,000 mentioned above. We also use Perl to run TAP tests, such as testing initdb and pg_dump. And it runs building and testing when we’re building with the Microsoft tool-sets on Windows.

So, what changes have been made? First, we’ve refined slightly the setup for pgperltidy, our utility for formatting perl code. This utility, based on a well known perl utility, does for perl code what pgindent does for C code.

Second, we’ve added a script and a profile to run a utility called perlcritic. This utility checks to see if perl code complies with a set of “best practises”. Currently we’re only testing for the “worst” practices, but I hope in future to be able to check in a much stricter way. The infrastructure is now there to support it.

Finally, there have been some code adjustments to allow us to check all the perl files for compile time errors and warnings, and a utility to run those checks.

These changes mirror some changes I have made in the buildfarm client and server code.

It’s easy to forget about these things, so I’ve also added a Buildfarm module to run the check on perl. If it finds any policy violations or compiler time errors or warnings we’ll soon know about it.

↧