Quantcast
Channel: Planet PostgreSQL
Viewing all 9885 articles
Browse latest View live

Laurenz Albe: JSON in PostgreSQL: how to use it right

$
0
0

JSON, the lastest step in database technology
© Laurenz Albe 2021

The comprehensive JSON support in PostgreSQL is one of its best-loved features. Many people – particularly those with a stronger background in Javascript programming than in relational databases – use it extensively. However, my experience is that the vast majority of people don’t use it correctly. That causes problems and unhappiness in the long run.

In this article, I will try to point out good and bad uses of JSON in PostgreSQL, and provide you with guidelines that you can follow.

A bad example

This data model exemplifies everything that you can do wrong:

/* this table is fine */
CREATE TABLE people (
   id bigint PRIMARY KEY,
   name text NOT NULL
);

INSERT INTO people VALUES
   (1, 'laurenz'),
   (2, 'julian'),
   (3, 'ants'),
   (4, 'kaarel');

/* this table is ill-defined */
CREATE TABLE rooms (
   id bigint PRIMARY KEY,
   data jsonb
);

INSERT INTO rooms VALUES
   (1,
    '{ "name": "Room 1",
       "reservations": [
           { "who": 1, "from": "2021-06-01 09:00:00", "to": "2021-06-01 10:00:00" },
           { "who": 3, "from": "2021-06-01 10:00:00", "to": "2021-06-01 11:30:00" },
           { "who": 2, "from": "2021-06-01 13:15:00", "to": "2021-06-01 14:45:00" },
           { "who": 1, "from": "2021-06-01 15:00:00", "to": "2021-06-01 16:00:00" }
        ]
     }'),
   (2,
    '{ "name": "Room 2",
       "reservations": [
           { "who": 2, "from": "2021-06-01 09:30:00", "to": "2021-06-01 10:30:00" }
        ]
     }');

First mistake: model regular columns as JSON attributes

There is no reason not to have the room name as a regular column. After all, every room will have a name, and we may want to enforce constraints like uniqueness on the room name.

Second mistake: model tabular data as JSON array

The room reservations are perfectly regular tabular data that define a many-to-many relationship between the rooms and the people. It would have been simple to model the same data with a junction table:

CREATE TABLE reservations (
   people_id bigint REFERENCES people NOT NULL,
   room_id   bigint REFERENCES rooms  NOT NULL,
   reserved  tsrange                  NOT NULL,
   PRIMARY KEY (room_id, people_id)
);

INSERT INTO reservations VALUES
   (1, 1, '[2021-06-01 09:00:00,2021-06-01 10:00:00)'),
   (3, 1, '[2021-06-01 10:00:00,2021-06-01 11:30:00)'),
   (2, 1, '[2021-06-01 13:15:00,2021-06-01 14:45:00)'),
   (1, 1, '[2021-06-01 15:00:00,2021-06-01 16:00:00)'),
   (2, 2, '[2021-06-01 09:30:00,2021-06-01 10:30:00)');

Many people seem to think that storing few large rows in a table is more efficient than storing many small rows. There is some truth to that, since every row has some overhead, and PostgreSQL compresses large data. But if you want to retrieve only parts of the data, or want to modify them, many small rows are much more efficient – as we will see below.

Third mistake: store foreign keys in JSON

The "who" attribute stores a foreign key reference to people. That is not a good idea, because it is impossible for the database to enforce such a constraint: I could just as well have inserted a reference to a non-existing person. With the junction table from above, defining a foreign key is trivial.

Moreover, you often want to join on foreign keys. With JSON, that would require a cross join with the unnested JSON array:

SELECT rooms.data ->> 'name',
       people.name
FROM rooms
   CROSS JOIN LATERAL jsonb_array_elements(
                         rooms.data -> 'reservations'
                      ) AS res(j)
   JOIN people
      ON res.j ->> 'who' = people.id::text;

With the junction table, that would be

SELECT rooms.name,
       people.name
FROM rooms
   JOIN reservations AS r ON r.room_id = rooms.id
   JOIN people ON r.people_id = people.id;

You can probably guess which of these two queries will be more efficient.

Fourth mistake: modify JSON data

If you want to add a new reservation, you have to execute a statement like

UPDATE rooms
SET data = jsonb_set(
              data,
              '{reservations,100000}',
              '{"who": 3, "from": "2021-06-01 11:00:00", "to": "2021-06-01 12:00:00"}',
              TRUE
           )
WHERE id = 2;

This will fetch the complete JSON object, construct a new JSON from it and store that new object in the table. The whole JSON object has to be read and written, which is more I/O than you would want – particularly if the JSON object is large and stored out of line.

Compare how simple the same exercise would be with the junction table:

INSERT INTO reservations VALUES
   (3, 2, '[2021-06-01 11:00:00,2021-06-01 12:00:00)');

This statement will only write a small amount of data.

Deleting a reservation is just as complicated and expensive, and is left as an exercise to the reader.

Fifth mistake: trying to enforce constraints on JSON

So far, our data model offers no protection against overlapping reservations, which would be good to enforce in the database.

With JSON, we are pretty much out of luck here. The best that comes to mind is a constraint trigger, but that would require elaborate locking or the SERIALIZABLE transaction isolation level to be free from race conditions. Also, the code would be far from simple.

With the junction table, the exercise is simple; all we have to do is to add an exclusion constraint that checks for overlaps with the && operator:

CREATE EXTENSION IF NOT EXISTS btree_gist;

ALTER TABLE reservations ADD EXCLUDE USING gist (
   reserved WITH &&,
   room_id WITH =
);

The extension is required to create a GiST index on a bigint column.

Sixth mistake: complicated searches in JSON

Simple searches for equality can be performed with the JSON containment operator @>, and such searches can be supported by a GIN index. But complicated searches are a pain.

Imagine we want to search for all rooms that are occupied at 2021-06-01 15:30:00. With JSON, that would look somewhat like

SELECT id
FROM rooms
   CROSS JOIN LATERAL jsonb_array_elements(
                         rooms.data -> 'reservations'
                      ) AS elem(j)
WHERE CAST(elem.j ->> 'from' AS timestamp) <= TIMESTAMP '2021-06-01 15:30:00'
  AND CAST(elem.j ->> 'to'   AS timestamp) >  TIMESTAMP '2021-06-01 15:30:00';

With our junction table, the query becomes

SELECT room_id
FROM reservations
WHERE reserved @> TIMESTAMP '2021-06-01 15:30:00';

That query can use the GiST index from the exclusion constraint we created above.

If all the above is wrong, should we use JSON in PostgreSQL at all?

Don’t get me wrong: JSON support in PostgreSQL is a wonderful thing. It is just that many people don’t understand how to use it right. For example, the majority of questions about PostgreSQL and JSON asked on Stackoverflow are about problems that arise from the use of JSON where it had better been avoided.

Follow these guidelines when you consider using JSON in PostgreSQL:

  • Don’t use JSON for data that can easily be stored in database tables.
  • Avoid large JSON objects if you want to modify individual attributes.
  • Don’t use JSON if you want to use attributes in complicated WHERE conditions.
  • Avoid JSON if you want to join on some of the attributes.
  • Don’t use JSON if you want constraints on the data.
  • Don’t store references to other tables in JSON attributes.

Often it may be a good idea to store some attributes as regular table columns and others in a JSON. The less you need to process the data inside the database, the better it is to store them as JSON.

A good example

To show an example of how JSON in the database can be used with benefit, let us consider a shop that sells all kinds of mixed goods. There are some properties that all or most of the goods will have, like price, weight, manufacturer, number available or package size. Other attributes may be rare, like the type of power plug, or ambiguous, like the pitch in screws or tuning forks.

Rather than defining a table with hundreds of columns for all the possible attributes, most of which will be NULL, we model the most frequent attributes with normal table columns and use a JSON for the rest:

CREATE TABLE wares (
   id           bigint           PRIMARY KEY,
   name         text             UNIQUE NOT NULL,
   price        numeric(15,2)    NOT NULL,
   weight_kg    double precision NOT NULL,
   available    integer          NOT NULL
      CHECK (available >= 0),
   manufacturer text,
   package_size integer          NOT NULL
      CHECK (package_size >= 0),
   attributes   jsonb
);

/* to search in "attributes" */
CREATE INDEX ON wares USING gin (attributes);

/* for similarity search on "name" */
CREATE EXTENSION IF NOT EXISTS pg_trgm;

CREATE INDEX ON wares USING gin (name gin_trgm_ops);

This will allow efficient queries like

SELECT name, price, available
FROM wares
WHERE name LIKE '%tuning fork%'
  AND attributes @> '{"pitch": "a"}';

Conclusion

There is a lot you can do wrong when using JSON in PostgreSQL, especially if you are not acquainted with relational databases. However, it can be a powerful tool – when used properly.

The post JSON in PostgreSQL: how to use it right appeared first on CYBERTEC.


Michael Christofides: Why isn’t Postgres using my index?

$
0
0

Cover photo: Emily Morter

If you spend any amount of time working with Postgres, it’s likely that at some point you’re going to wonder why it isn’t using an index that you think it should be.

Unlike some databases, you can’t force PostgreSQL to use a specific index, but there are several things you can do to work out what’s going on.

If you’re somewhat new to Postgres, I’m hoping going through these will be interesting, as well as being useful next time you see an issue like this.

I’ll be assuming a basic familiarity with query plans (using EXPLAIN) and Postgres indexing, but honestly not much.

The two main reasons

There are two main categories of reason that result in Postgres not using an index. Either it can’t use the index, or it doesn’t think using the index will be faster.

Working out which of these categories you’re seeing is a great starting point.

Sadly, we can’t tell from a single query plan which case we’re seeing, but through some investigating via multiple query plans, we can work out what’s going on.

Since the reasons that Postgres can’t use an index are more straightforward, and it can be easy to rule out quickly too, I prefer to check that first.

Checking whether Postgres can use the index

There are a few reasons Postgres won’t use a seemingly perfect index for a query.

For example:

  • A function is preventing its use 

  • A data type mismatch is preventing its use

  • The index does not exist (oops)

To start with, you can try to disprove you’re in this category of issue by getting a query plan that does use the index.

As I alluded to earlier, some other databases have a “query hints” feature that allows you to (pretty much) force the use of a specific index. But since PostgreSQL does not have such a feature, we can steer clear of that holy war and look at what we can do instead.

For example, Postgres has several parameters that let us temporarily discourage the planner’s choice of certain operations. If you want to encourage an index scan in place of a sequential scan, you can try:

set enable_seqscan = off;

This does not, as the same might suggest, completely disable sequential scans, but rather discourages them by making the planner think they’d be a LOT more expensive than they really would be.

As such, if after setting this to off, this you still get a Seq Scan, there’s a really good chance that Postgres can’t use the index you want, and you’re seeing this category of issue. If, however, it does now use the index you’d hoped, you can skip to the other reasons.

Similarly, if Postgres is picking a different index instead of the index you think would be optimal, then this trick to make indexes invisible shared by Haki Benita (to drop an index and then rollback, all inside a transaction) is a nice way to test this, in a non-production environment of course. Again, whether the query plan changes or not gives you a very good signal as to which category of problem you have.

If Postgres can’t use the index

If you think you’re in the boat of Postgres not being able to use the index, it is first worth checking that the index you’re expecting Postgres to use actually exists (in the environment you’re testing in). If it doesn’t, please believe me that you are not the first, and won’t be the last, to be in this situation! And at least you now have a simple solution.

Assuming the index does exist, though, the next step is to check its definition. This will be helpful for the next two checks we’ll do.

Functions

If your query has a function on the data involved, your index will likely need to match it to get used. For example, a query like:

select * from t where lower(email) = ‘michael@pgmustard.com’;

Won’t use a simple index on “email”, even if in theory it could.

To quickly test if this is the issue you’re seeing, you can request the query plan for the query without the function, for example:

explain select * from t where email = ‘michael@pgmustard.com’;

If you see it is now using your index, and this is how you commonly query that column, you may wish to add a functional index. For example, in this case:

create index i on t (lower(email));

Similarly, you might see this when doing arithmetic on a column. For example, the following query would require an index on “(column + 1)”:

select * from t where column + 1 = 10;

Whereas the equivalent query below would be able to use an index on “column”:

select * from t where column = 10 - 1;

In cases like this, if you are able to change the query, that is normally the better solution.

Datatype mismatches

Earlier, I also mentioned datatype mismatches. You might spot these in the query plan through datatype casting (the :: characters). Postgres can handle some of these without issue, eg varchar to text, but some casts will result in an index not being used.

Here is a contrived example:

explain select * from t where id = 100::numeric;

The explicit cast prevents use of my index, resulting in this query plan:

 Seq Scan on t  (cost=0.00..2041.00 rows=500 width=10)
   Filter: ((id)::numeric = '100'::numeric)

To test whether this is the issue, you can try explicitly casting to the column’s datatype in your query, or avoiding casting being added by the application (eg via an ORM).

If Postgres can use the index, but doesn’t think it will be faster

If you’ve determined that Postgres can use the index, but is choosing not to, then this next section is for you.

There are usually several ways Postgres can get the results of a given query. As such, before executing a query, the planner will estimate the “cost” of different options, and pick the plan with the lowest cost in the hope that it will be the fastest way of executing that query. It’s important to note that these costs are in an arbitrary unit, where higher numbers are a predictor of higher execution times, so it’s the relative numbers that matter, not the absolutes. 

As such, if Postgres chooses not to use your index, it is likely that it is calculating the cost of that query plan to be higher than the one it has chosen.

We can verify this by using the tricks mentioned in the previous section, like enable_seqscan = false, or hiding other indexes, and comparing the estimated costs of the different query plans.

The next step is to check whether the execution time of the plan that uses the index is actually faster, or not. You may wish to run it a few times to warm up the cache. If it isn’t faster, you are looking at a case where the planner has correctly chosen not to use your index.

At this point, it is worth noting that the Postgres planner will try to optimize for execution time, rather than efficiency (eg blocks of data read). Although these usually go hand in hand, it is a subtle distinction worth bearing in mind.

There are a few (normally good) reasons for Postgres choosing a sequential scan even when it could use an index scan:

  1. If the table is small

  2. If a large proportion of the rows are being returned

  3. If there is a LIMIT clause and it thinks it can abort early

If none of these are the case, you may wish to skip ahead to the cost estimation section.

If the table is small

If a table is small (very roughly 100 rows or fewer), Postgres may estimate that it will be faster to read the table sequentially and filter out rows as needed, even for highly selective queries.

For example, in pgMustard we show scans on small tables as an anti-tip (scoring 0.0 out of 5.0 for index potential):

small-table.png

If this is a test environment, and you have a lot more data in production, you may need to consider testing on a more realistic dataset.

On the other hand, if this is a realistic dataset already, and it would in fact have been significantly faster to use an index, you may wish to look into the cost constants section below.

If a significant proportion of the rows are being returned

Another case where Postgres will choose to do a sequential scan over an index scan is when a significant proportion of the table is being returned. The exact proportion of what will count as significant will vary depending on the data and cost parameters, but here is an example (with default configuration) of it doing so to return 30% of the data:

significant-proportion.png

Similarly to the small table case, in pgMustard, you will see cases like this as an “index potential” anti-tip, with details about the proportion of rows being returned.

Again, if you work out that (for your query and hardware) it would have been faster to use an index, a good next step is to look into the configuration parameters section below.

Limit clauses

The third case I mentioned above can happen when you have a low LIMIT clause (relative to the total number of rows), and the planner believes it is likely to find enough qualifying rows quickly, allowing it to abort early.

This optimization is a double-edged sword, leading to very fast queries when the planner is correct, and very slow queries when it is wrong. The guide to using EXPLAIN in the PostgreSQL documentation includes a nice LIMIT example, and Christophe Pettus, CEO at PostgreSQL Experts Inc, also succinctly described the potential dangers of using LIMIT in a recent Twitter thread. 

As well as avoiding the use of LIMIT when you don’t need it, you can also help the query planner estimate row counts more accurately. Luckily, we’re just about to go into how to do that.

Cost estimation

Naturally, a lot of the times the query planner chooses a suboptimal query plan, it’ll boil down to inaccurate cost estimation.

There are a few ways Postgres can end up badly estimating the overall cost, but the biggest two categories of issue are:

  1. Its row count estimates are way off

  2. Its cost constants are suboptimal

In my experience, bad row count estimates have been the more common of the two, so let’s start there.

Bad row count estimates

The query planner estimates the number of rows each operation will return (mostly) using statistics it has gathered. It keeps these statistics up to date via ANALYZE– not to be confused with the EXPLAIN parameter of the same name. 

These estimates can end up way off if the data changes dramatically before it gathers new statistics, or if any of its assumptions are bad for a different reason. Naturally, these bad estimates can affect things that are very important for performance; like join order, join strategy, and of course whether (and how) it uses indexes.

Bad estimates can sometimes be quickly resolved by manually running ANALYZE on the tables involved. If this helps, then you should look into preventing recurrences by ensuring ANALYZE runs more regularly going forwards (eg via autovacuum).

If one of the columns involved has a skewed distribution you may see better estimates after increasing the statistics target.

Another common source of bad estimates is that Postgres assumes that two columns are independent by default. If that’s the root cause, we can ask it to gather data on the correlation between two columns (in the same table, at least) via multivariate statistics

If this section felt a little rushed, you might like our more in depth post dedicated to bad row count estimates.

Suboptimal configuration

Last, last but certainly not least, we come to planner cost constants. These are used by the query planner to determine the cost of each operation, and subsequently the query plan as a whole.

Cost configuration is a big a topic, so we’ll take a look at the basics and a couple of resources to learn more.

On the topic of index use, if you’re still seeing a Seq Scan instead of the Index Scan that would be faster, then a good parameter to check is random_page_cost:

show random_page_cost;

By default, this is set to 4.0, which is 4x higher than the seq_page_cost of 1.0. This ratio made a lot of sense for spinning disks, but assuming your system has SSDs, you may wish to look into a lower setting. It is not uncommon to see modern PostgreSQL setups with random_page_cost set to 2.0, 1.5, or even 1.1, but as ever, please do test carefully before changing this on production.

To test whether it helps, you can change a parameter like this in a session and rerun EXPLAIN to see whether they result in a different query plan being chosen.

Another common cost-related reason for Postgres favouring a Seq Scan over an Index Scan is if it is parallelizing operations. To quickly test this, you can disable parallel execution locally and re-request the query plan, for example via:

set max_parallel_workers_per_gather = 0;

There are many other cost-related configuration parameters that you may find beneficial to tune. The query planning section of the PostgreSQL docs contains a useful list, and you can then look up each of them on the postgresqlco.nf website, which includes advice, courtesy of Josh Berkus’ excellent annotated.conf, as well as useful information and links courtesy of the team at Ongres.

In summary

So in summary, our first step was to check whether Postgres could use the index.

If it couldn’t, we double-checked it actually existed, before seeing whether a function or datatype mismatch could be the problem.

If it could, we compared the execution time to make sure it wasn’t right all along, then looked into the row count estimates and planner cost constants, looking for ways to help it calculate the relative costs more accurately.

Next steps

If, after this, you’re still facing an issue, both the PostgreSQL Wiki and Stack Overflow have good guides for troubleshooting performance problems, asking for help, and reporting bugs.

I hope it’s been somewhat helpful though, and it’d be great to hear from you either way. I’d particularly welcome hearing about things I could improve, for clarity or completeness.


If you’d like a monthly email with Postgres performance blog posts and videos, as well as any pgMustard news, check out our newsletter.


Paul Brebner: Using Apache Superset to Visualize PostgreSQL JSON Data (Pipeline Series Part 7)

$
0
0

In the last installment of the pipeline blog series, we explored writing streaming JSON data into PostgreSQL using Kafka Connect. In this blog, it’s time to find out if our fears of encountering mutant monsters in the gloom of an abandoned uranium mine were warranted or not.

1. What Is Apache Superset?

What is Apache Superset? Well, it could be a set of (possibly mutant) superheroes (supposedly, many superheroes got their powers from mutations, and biological mutations are actually normal and not a byproduct of radiation)?

(Source: Shutterstock)

Or perhaps it could be related to a mathematical superset? B is a superset of A if B contains all the elements of A (B ⊇ A). For example, if A is the set of all superheroes that use weapons, and B is the set of all superheroes, then B is the superset of A:

However, my guess is that the mathematical idea inspired Apache Superset as the project page proclaims that:

“Apache Superset is a modern data exploration and visualization platform that is fast, lightweight, intuitive, and loaded with options that make it easy for users of all skill sets to explore and visualize their data, from simple line charts to highly detailed geospatial charts.”

Here are some interesting points I discovered about Superset:

  • It is based on SQL and can connect to any SQL database that is supported by SQL Alchemy. So at least in theory, this gives good interoperability with a large number of data sources (but also excludes some common NoSQL databases, unfortunately). 
  • But what if you aren’t an SQL guru? It supports a zero-code visualization builder, but also a more detailed SQL Editor. 
  • Superset supports more than 50 built-in charts offering plenty of choices for how you visualize your data. 
  • The main Superset concepts are databases, data sources, charts, and dashboards. 
  • It’s written in Javascript and Python.
  • The focus is on “BI” (Business Intelligence) rather than machine metrics (c.f. Prometheus/Grafana) or scientific data visualization (c.f. https://matplotlib.org/– which helped produce the 1st photograph of a black hole).
  • Superset started out in Airbnb, and graduated to being a top-level Apache project this year (2021), so it’s timely to be evaluating it now. 

In this blog, I wanted to get Apache Superset working with PostgreSQL. Some of the challenges I expected were:

  1. How easy is it to deploy Apache Superset?
  2. Can Apache Superset be connected to PostgreSQL (specifically, a managed Instaclustr PostgreSQL service, I had access to a preview service)?
  3. Is it possible to chart JSONB column data types in Apache Superset?
  4. Does Apache Superset have appropriate chart types to display the basic NOAA tidal data and also map the locations?

2. Deploying Apache Superset

So far in the pipeline series I’ve been used to the ease of spinning up managed services on the Instaclustr platform (e.g. Elasticsearch and Kibana). However, we currently don’t have Superset on our platform, so I was going to have to install and run it on my laptop (a Mac).  I’m also used to the ease of deploying Java-based applications (get a jar, run jar). However, Superset is written in Python, which I have no prior experience with. I, therefore, tried two approaches to deploying it.

Approach 1

I initially tried to install Superset from scratch using “pip install apache-superset”. However, this failed with too many errors.

Approach 2

My second approach used Docker, which eventually (it was very slow to build) worked. I just used the built-in Mac Docker environment which worked fine—but don’t forget to increase the available memory from the default of 2GB (which soon fails) to at least 6GB. It’s also very resource-hungry while running, using 50% of the available CPU.

Superset runs in a browser at  http://localhost:8088

Don’t forget to change the default user/password from admin/admin to something else.

So, given our initial false start, and by using Docker, it was relatively easy to deploy and run Superset. 

3. Connecting Superset to PostgreSQL

In Superset, if you go into “Data->Databases” you find that there’s already a PostgreSQL database and drivers installed and running in the Docker version. It turns out that Superset uses PostgreSQL as the default metadata store (and for the examples). 

But I wanted to use an external PostgreSQL database, and that was easy to configure on Superset following this documentation.

The PostgreSQL URL format to connect to a fully managed Instaclustr enterprise-grade service for PostgreSQL looks like this (with the pguser, pgpassword and pgip address obtained from the Instaclustr Console):

postgresql://pguser:pgpassword@pgip:5432/postgresl

Superset has the concepts of databases and datasets. After connecting to the Instaclustr managed PostgreSQL database and testing the connection you “Add” it, and then a new database will appear. After configuring a database you can add a dataset, by selecting a database, schema, and table.

So, the second challenge was also easy, and we are now connected to the external PostgreSQL database.

4. Apache Superset Visualization Types

Clicking on a new dataset takes you to the Chart Builder, which allows you to select the visualization type, select the time (x-axis) column (including time grain and range), and modify the query. By default, this is the “Explore” view which is the zero-code chart builder. There’s also the SQL Lab option which allows you to view and modify the SQL query.

Here are the available visualization types available (which I noticed includes my favorite graph type, the Sankey diagram, which is great for systems performance analysis; Chord diagrams are also cool):

6. Charting PostgreSQL Column Data With Apache Superset

Next, I wanted to check that I could actually get data out of PostgreSQL and chart it, so I cheated (a bit) and uploaded a year’s worth of NOAA tidal data for one location using the CSV option (“Data->Upload a CSV”), which creates a table with columns in PostgreSQL, and a new corresponding dataset in Superset. From the new dataset, I created a Time-series Chart, with time as the x-axis, Day as the Time Grain, and MAX(value) as the metric. This gives the following graph showing the maximum tide height per day, so that ticks another box.

7. Charting PostgreSQL JSONB Data With  Apache Superset

The real challenge was to chart the NOAA data that was streaming into my PostgreSQL database as a result of the previous blog, and which was being stored as a JSONB data type.

I created a dataset for the NOAA JSON table and tried creating a chart. Unfortunately, the automatic query builder interface didn’t have any success interpreting the JSONB data type, it just thought it was a String. So after some Googling, I found a workaround involving “virtual” datasets (JSON or Virtual datasets don’t seem to feature in the Apache documentation, unfortunately).

However, if you look at “Data->Datasets” you’ll notice that each dataset has a Type, either Physical or Virtual. Datasets are Physical by default, but you can create Virtual datasets in the SQL Lab Editor. Basically, this allows you to write your own SQL resulting in a new “virtual” table/dataset. What I needed to do was to create a SQL query that reads from the NOAA JSONB data type table and creates a new table with the columns that I need for charting, including (initially at least) the station name, time, and value. 

Here’s an example of the JSON NOAA data structure:

{"metadata": {
   "id":"8724580",
   "name":"Key West",
   "lat":"24.5508”,
   "lon":"-81.8081"},
"data":[{
   "t":"2020-09-24 04:18",
   "v":"0.597",
      "s":"0.005", "f":"1,0,0,0", "q":"p"}]}

Again, after further googling and reading my blog on PostgreSQL JSON, and testing it out in the SQL Editor (you can see the results), I came up with the following SQL that did the trick:

SELECT
cast(d.item_object->>'t' as timestamp) AS __timestamp, json_object->'metadata'->'name' AS name, AVG(cast(d.item_object->>'v' as FLOAT)) AS v
FROM
tides_jsonb,jsonb_array_elements(json_object->'data')
with ordinality d(item_object, position)
where d.position=1
GROUP BY name, __timestamp;

Compared with the previous approach using Elasticsearch, this SQL achieves something similar to the custom mapping I used which overrides the default data type mappings. The difference is that using PostgreSQL you have to manually extract the correct JSON elements for each desired output column (using the “->>” and “->” operators), and cast them to the desired types. The functions “jsonb_array_elements” and “ordinality” are required to extract the first (and only) element from the ‘data’ array (into d), which is used to get the ‘v’ and ‘t’ fields.

Once you’ve got this working, you click on the Explore button and “Save or Overwrite Dataset” to save the query as a virtual dataset. The new dataset (with Type Virtual in the Datasets view) is now visible, and you can create a Time-series Chart for it just as we did with the sample (CSV) column table. To display multiple graphs on one chart you also have to Group by name. The (automatically generated) SQL is pretty complex, but I noticed that it’s just nesting the above virtual query in a “FROM” clause to get the virtual dataset and graph it. 

Here’s the chart with data from 2 stations:

So far, so good, now let’s tackle the mapping task.

8. Mapping PostgreSQL JSONB Data

The final task I had set myself was to attempt to replicate displaying station locations on a map, which I’d previously succeeded in doing with Elasticsearch and Kibana.

The prerequisite is to add the lat/lon JSON elements to a virtual dataset as follows:

SELECT
cast(d.item_object->>'t' as timestamp) AS __timestamp, json_object->'metadata'->'name' AS name, 
cast(json_object->'metadata'->>'lat' as FLOAT) AS lat,
cast(json_object->'metadata'->>'lon' as FLOAT) AS lon, 
AVG(cast(d.item_object->>'v' as FLOAT)) AS v
FROM
tides_jsonb,jsonb_array_elements(json_object->'data')
with ordinality d(item_object, position)
where d.position=1
GROUP BY name, __timestamp, lat, lon

Unlike Elasticsearch/Kibana which needed a mapping to a special geospatial data type (geo_point) to work, Superset is happy with a FLOAT type. 

The next step is to find a map visualization type. Typing “map” into the visualization type search box gives these results:

As you can see, only a few of these are actual location-based maps, and they turned out to be mainly maps for country data. I also searched for “geo” and found one map called “deck.gl Geojson”, which looked more hopeful, and searching for “deck” I found a bunch of potential candidates:

I found out that deck.gl is a framework for detailed geospatial charts so these looked promising. However, after trying a few I discovered that they didn’t have any base maps displayed. After some frantic googling, I also discovered that you need a “Mapbox” token to get them to work. You can get a free one from here. To apply the Mapbox token you have to stop the Superset Docker, go to the Superset/Docker folder, edit the non-dev  config file (.env-non-dev) and add a line “MAPBOX_API_KEY = “magic number”, save it, and start superset docker again—and then the base maps will appear.

The final problem was selecting an appropriate chart type. This turned out to be trickier than I thought, as many of them have specific use cases (e.g. some automatically aggregate or cluster points). For example, the MapBox visualization successfully shows the location of the NOAA tidal stations, but at the zoomed-out level, it aggregates them and displays a count.  Note that for map charts, you have to select the Longitude and Latitude columns; but watch out as this is the reverse order to the normal lat/lon order convention.

Finally the deck.gl Scatterplot visualization type (with satellite map style) succeeded in showing the location of each station (small blue dots around coastline).

But what I really wanted to do was show a tidal value for each station location as I’d done with Kibana previously. Luckily I worked out that you can also change the size and color of the points as well. I uploaded the NOAA sea level trends (mm/year), configured point size to be based on the value with a multiplier of 10 to make them bigger, and selected Point Color->Adaptative formatting, which changes the point color based on the point size metric (unfortunately the colour gradient isn’t meaningful). Zooming in on the map you can see what this looks like in more detail, and you can easily see which tidal stations have bigger sea level trends (maybe best not to buy coastal properties there!)

9. Conclusions

How does the PostgreSQL+Superset approach compare with the previous Elasticsearch+Kibana approach?

The main difference I noticed in using them was that in Elasticsearch the JSON processing is performed at indexing time with custom mappings, whereas for Superset the transformation is done at a query-time using SQL and JSONB operators to create a virtual dataset. 

In terms of interoperability, Kibana is limited to use with Elasticsearch, and Superset can be used with any SQL database (and possibly with Elasticsearch, although I haven’t tried this yet). Superset has more chart types than Kibana, although Kibana has multiple plugins (which may not be open source and/or work with Open Distro/OpenSearch).

There may also be scalability differences between the two approaches. For example, I encountered a few scalability “speed-humps” with Elasticsearch, so a performance comparison may be interesting in the future. 

Follow the Pipeline Series

  • Part 1: Building a Real-Time Tide Data Processing Pipeline: Using Apache Kafka, Kafka Connect, Elasticsearch, and Kibana
  • Part 2: Building a Real-Time Tide Data Processing Pipeline: Using Apache Kafka, Kafka Connect, Elasticsearch, and Kibana
  • Part 3: Getting to Know Apache Camel Kafka Connectors
  • Part 4: Monitoring Kafka Connect Pipeline Metrics with Prometheus
  • Part 5: Scaling Kafka Connect Streaming Data Processing
  • Part 6: Streaming JSON Data Into PostgreSQL Using Open Source Kafka Sink Connectors
  • Part 7: Using Apache Superset to Visualize PostgreSQL JSON Data

The post Using Apache Superset to Visualize PostgreSQL JSON Data (Pipeline Series Part 7) appeared first on Instaclustr.

Franck Pachot: The cost and benefit of synchronous replication in PostgreSQL and YugabyteDB

$
0
0

I have seen a people comparing YugabyteDB and PostgreSQL, and surprised by the different throughput when running a simple test on a from a single session. The purpose of a distributed database is to scale out. When running on a single node without the need for High-Availability-without-data-loss (this is a tautology), a monolith database will always perform with lower latency. Because a distributed DB is designed to ensure the persistence (the D in ACID) though RPC (remote procedure calls) rather than local writes.

Here is a simple workload:

drop table if exists demo;
create table demo(
 i int primary key,
 t timestamp default clock_timestamp()
);

\timing on

do $$ 
 begin
  truncate demo; 
  for i in 1..1e4 loop
   insert into demo(i) values(i);
   commit; 
  end loop; 
 end; 
$$;

select
 count(*)/extract(epoch from max(t)-min(t)) "rows/s",
 count(*),max(t)-min(t) "duration"
 from demo;

YugabyteDB

Here is the run in the current production (stable) release, a RF=3 with all nodes on the same VM (for this test, in order to be independent on network latency, you don't do that in production):

[postgres@yb0 ~]$ psql -p 5433

yugabyt=# select version();
                                                  version
------------------------------------------------------------------------------------------------------------
 PostgreSQL 11.2-YB-2.6.1.0-b0 on x86_64-pc-linux-gnu, compiled by gcc (Homebrew gcc 5.5.0_4) 5.5.0, 64-bit

yugabyte=# do $$ begin truncate demo; for i in 1..1e4 loop insert into demo(i) values(i); commit; end loop; end; $$;
DO
Time: 37130.036 ms (00:37.130)

yugabyte=# select count(*)/extract(epoch from max(t)-min(t)) "rows/s",count(*),max(t)-min(t) "duration" from demo;

      rows/s      | count |    duration
------------------+-------+-----------------
 270.115078207229 | 10000 | 00:00:37.021258

The number by itself is not important. It is a lab on one VM but I'll run everything in the same machine to compare the thoughput.

PostgreSQL no HA

I'm starting PostgreSQL on the same server:

/usr/pgsql-13/bin/initdb -D /var/tmp/pgdata
echo "port=5432" >> /var/tmp/pgdata/postgresql.conf
/usr/pgsql-13/bin/pg_ctl -D /var/tmp/pgdata -l logfile start

And run the same:

[postgres@yb0 ~]$ psql -p 5432

postgres=# select version();

                                                version
-------------------------------------------------------------------------------------------------------------
 PostgreSQL 13.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.4.1 20200928 (Red Hat 8.4.1-1), 64-bit

postgres=# do $$ begin truncate demo; for i in 1..1e4 loop insert into demo(i) values(i); commit; end loop; end; $$;
DO
Time: 5533.086 ms (00:05.533)

postgres=# select count(*)/extract(epoch from max(t)-min(t)) "rows/s",count(*),max(t)-min(t) "duration" from demo;

       rows/s       | count |    duration
-------------------------+-------+-----------------
 1809.0359900474075 | 10000 | 00:00:05.527806

This is what make you think that PostgreSQL is faster. Yes there is a 1:7 factor here in transactions per second.

But we are comparing apples and oranges in term of resilience. YugabyteDB was running with Replication Factor RF=3 so that each write is propagated to a quorum of 2 out of 3 replicas. In a Yugabyte cluster with RF=3, you can kill a node and:

  • 2/3 of reads and writes continue to operate as if nothing happens. Thanks to the sharding of tables into tablets.
  • 1/3 of reads and writes, those which had their leader on the dead node, have to wait a few seconds to get one of the followers, on the surviving nodes, to be elected new leader (Raft protocol)

And all continues because we have the quorum. And all is consistent. And no committed transaction has been lost. The only consequence is that, until the first node is back, or a new node added, loosing a second node will stop the database. Still with no data loss. But RF=3 can tolerate only one node down, by definition.

This protection involves remote procedure calls. Let's see how PostgreSQL would behave with some higher availability

PostgreSQL with standby

I'll add two standby databases to my PostgreSQL cluster:

/usr/pgsql-13/bin/pg_basebackup -p 5432 -D /var/tmp/pgsby1 -R --slot=sby1 -C
echo "port=5441" >> /var/tmp/pgsby1/postgresql.conf
touch /var/tmp/pgsby1/recovery.signal
/usr/pgsql-13/bin/pg_ctl -D /var/tmp/pgsby1 -l logfile start

/usr/pgsql-13/bin/pg_basebackup -p 5432 -D /var/tmp/pgsby2 -R --slot=sby2 -C
echo "port=5442" >> /var/tmp/pgsby2/postgresql.conf
touch /var/tmp/pgsby2/recovery.signal
/usr/pgsql-13/bin/pg_ctl -D /var/tmp/pgsby2 -l logfile start

Here they are visible from the primary:

[postgres@yb0 ~]$ psql -p 5432 -c "select * from pg_replication_slots"
 slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size
----------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------
 sby1      |        | physical  |        |          | f         | t      |     298963 |      |              | 0/5000060   |                     | reserved   |
 sby2      |        | physical  |        |          | f         | t      |     299054 |      |              | 0/5000060   |                     | reserved   |
(2 rows)

[postgres@yb0 ~]$ psql -p 5432 -c "select * from pg_stat_replication"
  pid   | usesysid | usename  | application_name | client_addr | client_hostname | client_port |         backend_start         | backend_xmin |   state   | sent_lsn  | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_la
g | sync_priority | sync_state |          reply_time
-------------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+--------------+-----------+-----------+-----------+-----------+------------+-----------+-----------+----------
--+---------------+------------+-------------------------------
 298963 |       10 | postgres | walreceiver      |             |                 |          -1 | 2021-09-28 13:03:45.30363+00  |              | streaming | 0/5000060 | 0/5000060 | 0/5000060 | 0/5000060  |           |           |
  |             0 | async      | 2021-09-28 13:04:23.662693+00
 299054 |       10 | postgres | walreceiver      |             |                 |          -1 | 2021-09-28 13:03:48.668362+00 |              | streaming | 0/5000060 | 0/5000060 | 0/5000060 | 0/5000060  |           |           |
  |             0 | async      | 2021-09-28 13:04:23.663322+00
(2 rows)

Those are two asynchronous standbys receiving the streamed WAL.

I run the same mini-workload:

postgres=# do $$ begin truncate demo; for i in 1..1e4 loop insert into demo(i) values(i); commit; end loop; end; $$;
select count(*)/extract(epoch from max(t)-min(t)) "rows/s",count(*),max(t)-min(t) "duration" from demo;
DO
Time: 6437.456 ms (00:06.437)

postgres=# select count(*)/extract(epoch from max(t)-min(t)) "rows/s",count(*),max(t)-min(t) "duration" from demo;

       rows/s       | count |    duration
-------------------------+-------+-----------------
 1554.5772428235664 | 10000 | 00:00:06.432617

This is still quite fast. But is this High Availability? Not at all. Yes in case of total failure of the primary database, I don't need to restore a backup and can failover to one of the standby databases. But:

  • I'll lose some committed transactions because I'm in ASYNC replication. Recovery Point Objective is RPO>0
  • Because of the preceding, this cannot be automated. You need a human decision to evaluate the risk of data loss, and the probability to get the failed primary site back at least to get the WAL with the latest transactions, before opening the standby. Human decision means, in practice, a Recovery Time Objective in minutes or hours: RTO>0

This cannot be compared with YugabyteDB replication where all is automated within seconds, without data loss.

PostgreSQL with synchronous standby

We can reduce the RPO with synchronous replication:

echo "synchronous_standby_names = '*'" >> /var/tmp/pgdata/postgresql.conf
/usr/pgsql-13/bin/pg_ctl -D /var/tmp/pgdata reload

This configuration ensures that one standby received the WAL to cover the transaction before we return a successful commit to the user:

[postgres@yb0 ~]$ psql -p 5432 -c "select * from pg_stat_replication"

  pid   | usesysid | usename  | application_name | client_addr | client_hostname | client_port |         backend_start         | backend_xmin |   state   | sent_lsn  | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_la
g | sync_priority | sync_state |          reply_time
-------------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+--------------+-----------+-----------+-----------+-----------+------------+-----------+-----------+----------
--+---------------+------------+-------------------------------

 298963 |       10 | postgres | walreceiver      |             |                 |          -1 | 2021-09-28 13:03:45.30363+00  |              | streaming | 0/538E3F0 | 0/538E3F0 | 0/538E3F0 | 0/538E3F0  |           |           |
  |             1 | sync       | 2021-09-28 13:14:12.307231+00
 299054 |       10 | postgres | walreceiver      |             |                 |          -1 | 2021-09-28 13:03:48.668362+00 |              | streaming | 0/538E3F0 | 0/538E3F0 | 0/538E3F0 | 0/538E3F0  |           |           |
  |             1 | potential  | 2021-09-28 13:14:12.307294+00
(

Now, if I run my small workload again:

postgres=# do $$ begin truncate demo; for i in 1..1e4 loop insert into demo(i) values(i); commit; end loop; end; $$;
DO
Time: 13613.487 ms (00:13.613)

postgres=# select count(*)/extract(epoch from max(t)-min(t)) "rows/s",count(*),max(t)-min(t) "duration" from demo;

      rows/s      | count |    duration
-----------------------+-------+-----------------
 734.861683966413 | 10000 | 00:00:13.608003

The throughput has been divided by two.

Are we in High Availability here? This SYNC configuration requires complex monitoring and management. Because, even in sync, the persistence of WAL does not happen at the same time: first written and (fsync'd) then send (and acknowledged) to the standby, then returning "commit successful" to the user. There is no two-phase commit here. This is very different, in case of failure, from a consensus protocol as we can find in a distributed database. PostgreSQL databases are often used with ASYNC, and this is a very good DR (Disaster Recovery) solution where the data loss is minimal after a manual failover. SYNC replication is possible, but doesn't qualify as the same High Availability than distributed databases.

The numbers are not important here. They will depend on your machine and your network. Distributed databases can be in sync in a multi-AZ cluster, even multi-region. The point is that the thoughput is lower for a single session. But, because all nodes are active, this scales-out when having multiple sessions load-balanced over all nodes. You cannot do that with PostgreSQL standby that are read only.

YugabyteDB scale-out

I'm adding a "j" column for job number in my table:

drop table if exists demo;
create table demo(
 j int, i int,
 t timestamp default clock_timestamp(),
 primary key(j,i)
);

And run 3 parallel jobs doing the same inserts:

for i in {0..2} ; do
psql -h yb$i -p 5433 -c 'do $$ begin for i in 1..1e4 loop insert into demo(j,i) values('$i',i); commit; end loop; end; $$ ; ' &
done ; wait

Here is the result:

postgres=# select j, count(*)/extract(epoch from max(t)-min(t)) "rows/s",count(*),max(t)-min(t) "duration"
           from demo group by j;

 j |      rows/s      | count |    duration
--------+------------------+-------+-----------------
 0 | 181.329577731828 | 10000 | 00:00:55.148201
 2 | 180.088033513951 | 10000 | 00:00:55.528398
 1 | 188.044672569376 | 10000 | 00:00:53.178853
(3 rows)

With my 3 concurrent sessions I have inserted at 550 transaction per second. Again, this is a small lab. While single session short transaction have limited rate because of the distributed nature of commits, it can scale to many nodes. If you stay on one VM without synchronous replication to another site, PostgreSQL will be faster. Where distributed databases show all their power is when you add nodes, for high availability and load balance, without adding complexity because all the distributed protocol is already there.

There are additional things that you can't see in this short test. PostgreSQL cannot sustain those inserts indefinitely. The shared buffers are filling, checkpoint will occur, the filesystem cache will be synced to disk. And the most important: at some point you will need to VACUUM the table before the transaction id wraps around, or the database will hang. The first minutes of insert are very optimistic in PostgreSQL, which is fine with short peaks of activity.

Note that I've written something similar in the past about RDS PostgreSQL vs. Aurora. Even if both cannot scale out the writes, the HA in Aurora relies on remote WAL sync.

To many parameters to consider? Don't panic. Because YugabyteDB has the same API as PostgreSQL - it uses the same SQL and PL/pgSQL layer and similar open source license - you are not locked in your initial decision. You can start with PostgreSQL and scale with YugabyteDB, or vice-versa.

Pavel Stehule: plpgsql_check 2.0.1

$
0
0
I released new significant version of plpgsql_check - plpgsql_check 2.0.1. Although there are only two new features (and few bugfixes), these two features are important. I wrote about benefits of plpgsql_check for PL/pgSQL language developers in my blog Why you need plpgsql_check (if you write procedures in PLpgSQL). The plpgsql_check is PostgreSQL extensions, that does static analyze of PL/pgSQL code. It can detect lot of possible runtime bugs before execution, it can detect some performance or security isues too. More plpgsql_check can do coverage analyze, and has integrated profiler and tracer. The PL/pgSQL language is relative static type strict language, and then the static analyze is working well. But there are two limits. The statis analyze cannot to work with objects and values that are created (calculated) at runtime. These objects are local temporary tables (PostgreSQL doesn't support global temporary tables yet) and the results of dynamic SQL:

postgres=# \sf+ fx1
CREATE OR REPLACE FUNCTION public.fx1(tablename text)
RETURNS void
LANGUAGE plpgsql
1 AS $function$
2 DECLARE r record;
3 BEGIN
4 EXECUTE format('SELECT * FROM %I', tablename) INTO r;
5 RAISE NOTICE 'id=%', r.id;
6 END;
7 $function$

postgres=# SELECT * FROM plpgsql_check_function('fx1');
┌──────────────────────────────────────────────────────────────────────────────────────┐
│ plpgsql_check_function │
╞══════════════════════════════════════════════════════════════════════════════════════╡
│ warning:00000:4:EXECUTE:cannot determinate a result of dynamic SQL │
│ Detail: There is a risk of related false alarms. │
│ Hint: Don't use dynamic SQL and record type together, when you would check function. │
│ error:55000:5:RAISE:record "r" is not assigned yet │
│ Detail: The tuple structure of a not-yet-assigned record is indeterminate. │
│ Context: SQL expression "r.id" │
└──────────────────────────────────────────────────────────────────────────────────────┘
(6 rows)

postgres=# \sf+ fx2
CREATE OR REPLACE FUNCTION public.fx2()
RETURNS void
LANGUAGE plpgsql
1 AS $function$
2 BEGIN
3 CREATE TEMP TABLE IF NOT EXISTS ltt(a int);
4 DELETE FROM ltt;
5 INSERT INTO ltt VALUES(10);
6 END;
7 $function$

postgres=# SELECT * FROM plpgsql_check_function('fx2');
┌───────────────────────────────────────────────────────────┐
│ plpgsql_check_function │
╞═══════════════════════════════════════════════════════════╡
│ error:42P01:4:SQL statement:relation "ltt" does not exist │
│ Query: DELETE FROM ltt │
│ -- ^ │
└───────────────────────────────────────────────────────────┘
(3 rows)
In plpgsql_check 2.0.0 I can use pragmas TYPE and TABLE (note: an syntax of pragma in plpgsql_check is little bit strange, because the language PL/pgSQL doesn't support native syntax for pragma (custom compiler directive) (like ADA language or PL/SQL language):

CREATE OR REPLACE FUNCTION public.fx1(tablename text)
RETURNS void
LANGUAGE plpgsql
1 AS $function$
2 DECLARE r record;
3 BEGIN
4 PERFORM plpgsql_check_pragma('TYPE: r (id int)');
5 EXECUTE format('SELECT * FROM %I', tablename) INTO r;
6 RAISE NOTICE 'id=%', r.id;
7 END;
8 $function$

postgres=# SELECT * FROM plpgsql_check_function('fx1');
┌────────────────────────┐
│ plpgsql_check_function │
╞════════════════════════╡
└────────────────────────┘
(0 rows)

postgres=# \sf+ fx2
CREATE OR REPLACE FUNCTION public.fx2()
RETURNS void
LANGUAGE plpgsql
1 AS $function$
2 BEGIN
3 CREATE TEMP TABLE IF NOT EXISTS ltt(a int);
4 PERFORM plpgsql_check_pragma('TABLE: ltt (a int)');
5 DELETE FROM ltt;
6 INSERT INTO ltt VALUES(10);
7 END;
8 $function$

postgres=# SELECT * FROM plpgsql_check_function('fx2');
┌────────────────────────┐
│ plpgsql_check_function │
╞════════════════════════╡
└────────────────────────┘
(0 rows)

Franck Pachot: Bulk load into PostgreSQL / YugabyteDB - psycopg2

$
0
0

Fast load into a database table is a feature we need on any database. Datawarehouses use it daily. OLTP requires it regularly. And probably from the beginning of their existence to migrate data to it. With Python,the psycopg2 client is commonly used and I'll start there (this is the first post in a series). The main motivation is that psycopg2 doesn't have prepared statements, and parsing each INSERT, even with a list of rows, is not efficient for loading million of rows. But it has a nice alternative as it can call the COPY command.

All this works on any PostgreSQL-compatible engine. I'll use YugabyteDB here because loading data involves multiple nodes, and it is then critical to do it right. I'll simulate an IoT ingest taking the function from the Database performance comparison for IoT use cases. This project uses INSERT ... VALUES to load data but I'll use COPY which is the most efficient for any PostgreSQL compatible database that supports it. I've submitted a PR with some optimizations (now merged).

I'll define the IoT events with dataclasses_json and will use the psychopg2 driver which is the most frequently used with Python. This requires:

pip install psycopg2 dataclasses_json

I re-use the generate_events code from the MaibornWolff project:

fromdatetimeimportdatetimeimportrandomimportstringfromdataclassesimportdataclassfromdataclasses_jsonimportdataclass_jsonfromdatetimeimportdatetime,timezone@dataclass_json@dataclassclassEvent:timestamp:datetimedevice_id:strsequence_number:strtemperature:floatdef_rand_string(length):return''.join(random.choices(string.ascii_lowercase+string.digits,k=length))defgenerate_events(device_id,start_timestamp,num_events,sequence_number=1,device_spread=1):device_ids=[f"{_rand_string(4)}{device_id}{_rand_string(4)}"foriinrange(device_spread)]ifstart_timestamp==0:start_timestamp=int(datetime.now().timestamp()*1000)foriinrange(num_events):event=Event(datetime.fromtimestamp(start_timestamp/1000,timezone.utc),device_ids[i%device_spread],sequence_number,random.uniform(-20,35))yieldeventsequence_number+=1start_timestamp+=random.randint(5,10)*60

The only thing I changed here is define a timestamp datatype rather than an epoch.

I create an iot_demo table to store those events:

importpsycopg2yb=psycopg2.connect('postgres://yugabyte:franck@yb1.pachot.net:5433/yugabyte')yb.cursor().execute("""
        drop table if exists iot_demo;
        create table if not exists iot_demo (
            timestamp timestamp,
            device_id text,
            sequence_number bigint,
            temperature real,
            primary key(device_id,timestamp,sequence_number) 
            );
""")

This connects to my very small Yugabyte database that I keep publicly open. You can easily create one following the documentation or our free tier cloud.

I have defined the primary key without mentioning HASH or RANGE partitioning so that the code is compatible with any PostgreSQL database. In YugabyteDB, the defaults are HASH for the first column in the primary key and ASC for the others, so what I want to achieve is: hash sharding to distribute the devices and range on time to get the measures together.

As generate_events is a generator (there's a yield clause to return the event in the loop) I just loop on it to build a string version of the events (tab-separated columns)

StringIO

importioimportpsycopgdefload_events_with_copy_psycopg2(device_id,num_events):yb=psycopg2.connect('postgres://yugabyte:franck@yb1.pachot.net:5433/yugabyte')ysql=yb.cursor()ysql.execute("set yb_default_copy_from_rows_per_transaction=1000")tsv=io.StringIO()foreventingenerate_events(device_id=device_id,start_timestamp=0,num_events=num_events,sequence_number=1,device_spread=1):tsv.writelines(f'{event.timestamp}\t{event.device_id}\t{event.sequence_number}\t{event.temperature}\n')tsv.seek(0)ysql.copy_from(tsv,'iot_demo',sep="\t",columns=('timestamp','device_id','sequence_number','temperature'))tsv.seek(0)tsv.truncate(0)yb.commit()ysql.close()

In psychopg2 the copy_from() reads from a file, but here I generate events into an in-memory StringIO that will be read like a file by copy_from(). Don't forget to seek(0) after writing to it so that copy_from()starts at the begining. And truncate(0)when done.

I'm keeping all simple there. If there are some character strings that contain tabs or newlines, you may have to handle them as with TSV format. But in a future post I'll show that you can do better.

Pandas

You can also use pandas to format the CSV correctly and use copy_expert which allows for more options:

importpandasimportpsycopg2defload_events_with_copy_from_psycopg2(device_id,num_events):yb=psycopg2.connect('postgres://yugabyte:franck@yb1.pachot.net:5433/yugabyte')ysql=yb.cursor()ysql.execute("set yb_default_copy_from_rows_per_transaction=1000")events=[]foreventingenerate_events(device_id=device_id,start_timestamp=0,num_events=num_events,sequence_number=1,device_spread=1):events.append(event)csv=io.StringIO(pandas.DataFrame(events).to_csv(header=True,index=False))ysql.copy_expert("""
        COPY iot_demo(timestamp,device_id,sequence_number,temperature) 
        FROM STDIN WITH DELIMITER ',' CSV HEADER
        """,csv)yb.commit()ysql.close()

The one with pandas is easier, as it takes care of the CSV format.

Threads

A distributed database scales and you may load from multiple threads. This is easy as psycopg2 is thread safe.

Here is the main program that loads events in loops in threads:

importthreadingdefloop_in_thread(device_id,num_loops,num_events):t1=datetime.now().timestamp()forlinrange(num_loops):load_events_with_copy_psycopg2(device_id,num_events)t2=datetime.now().timestamp()print(f'{device_id:2d}: {(num_events)/(t2-t1):8.2f} rows/seconds/thread')threads=[]foriinrange(2):t=threading.Thread(target=loop_in_thread,args=[i,10,10000])threads.append(t)t.start()foriinthreads:t.join()

Here is the result on my tiny database:

[opc@C tmp]$ python3 copy_from.py
 0:   724.47 rows/seconds/thread
 1:   709.02 rows/seconds/thread
 2:   687.75 rows/seconds/thread

The goal of this post was to show that even with this old psycopg2 driver that doesn't support prepared statements, we can do fast loads. I'll write more on bulk inserts. In PostgreSQL, as in YugabyteDB, COPY is the optimal way to ingest data. The difference is that PostgreSQL by default loads all data in one transaction, and you can define intermediate commits with ROWS_PER_TRANSACTION option. However, in YugabyteDB, the inserted rows have to be buffered to be distributed efficiently to the tablet servers. It is more efficient to limit the size of transactions and the yb_default_copy_from_rows_per_transaction has been introduced with a default value of 1000 to avoid memory issues when loading large files on small servers. Setting this to 0 reverts to the same as PostgreSQL in case you want to be sure that all is rolled back in case of failure.

Anthony Sotolongo León: Boost query performance using Foreign Data Wrapper with minimal changes

$
0
0

Foreign Data Wrapper

Nowadays it is becoming more frequent for systems and applications to require querying data from outside the main database. PostgreSQL supports querying external postgres data using two core extensions dblink and postgres-fdw , the last one is a Foreign Data Wrapper (FDW), that is an implementation of SQL/MED standard, which is part of ANSI SQL 2003 standard specification. FDW is widely recommended to be used in PostgreSQL for this activity instead of dblink, because it provides standards-compliant syntax for accessing remote tables/data, and can give us better performance in some cases.

Executing queries that need external data can sometimes be slow but PostgreSQL’s planner can apply some optimizations for this, such as: running some activity in the remote server to try to reduce the data transferred from there or if it is possible execute remote JOIN operations to take advantage of remote server resources. In addition, it is possible to make adjustments to help the planner choose better decisions and help the executor take advantage of this. This blog will show the reader some simple tips, examples, and explanations about how to increase performance of queries using FDW in PostgreSQL.

Improving from the definition

Normally when working with FDW it is necessary to define three types of objects, such as: a SERVER to establish the connections to the remote host, the USER that defines what user will access the external host and finally the FOREIGN TABLE to map the tables in the foreign server. All of these objects mentioned previously have some options that can help to improve the performance of queries.

1: Fetch_size

The FDW retrieves data from the remote server using row bunches and the size of these bunches is defined by the fetch_size option (default 100). This option can be specified for a SERVER or FOREIGN TABLE, and the values defined for the last one have priority.

Example :

CREATE SERVER pgserver FOREIGNDATA WRAPPER postgres_fdw  OPTIONS(host'localhost', port '5432',dbname 'db2');
  CREATEUSER MAPPING FORcurrent_user  SERVER pgserver   OPTIONS (user'postgres',password 'my_pass');
CREATEFOREIGNTABLE f_tables.customers (
	customerid int4  ,
	firstname varchar(50) ,
	lastname varchar(50) ,
	username varchar(50) ,
	age int2 ,
	gender varchar(1)
) SERVER pgserver OPTIONS (schema_name'public', table_name'customers');

And if executes EXPLAIN command to see what is happening with the query is possible see how the executor implements Remote SQL:

EXPLAIN (analyze,verbose) select o.*,ftc.firstname from orders o join f_tables.customers as ftc using (customerid)

Hash Join  (cost=470.00..509.92rows=748 width=148) (actual time=5.960..55.366rows=12000 loops=1)
  Output: o.orderid, o.orderdate, o.customerid, o.netamount, o.tax, o.totalamount, ftc.firstname
  Hash Cond: (ftc.customerid = o.customerid)
  ->Foreign Scan on f_tables.customers ftc  (cost=100.00..126.83rows=561 width=122) (actual time=0.528..41.536rows=20000 loops=1)
        Output: ftc.customerid, ftc.firstname, 
        Remote SQL: SELECT customerid, firstname FROMpublic.customers
  ->  Hash  (cost=220.00..220.00rows=12000 width=30) (actual time=5.408..5.409rows=12000 loops=1)
        Output: o.orderid, o.orderdate, o.customerid, o.netamount, o.tax, o.totalamount
        Buckets: 16384  Batches: 1  Memory Usage: 889kB
        ->  Seq Scan onpublic.orders o  (cost=0.00..220.00rows=12000 width=30) (actual time=0.014..2.167rows=12000 loops=1)
              Output: o.orderid, o.orderdate, o.customerid, o.netamount, o.tax, o.totalamount
Planning Time: 0.197 ms
Execution Time: 56.162 ms

With logging properly configured (by log_statement = ‘all’) it will be possible to see what is happening behind the scenes:

0- [18580] postgres@db1 LOG:  execute<unnamed>: explain (analyze,verbose) select o.*,ftc.firstname from orders o join f_tables.customers as ftc using (customerid)
1- [22696] postgres@db2 LOG:  statement: STARTTRANSACTIONISOLATIONLEVELREPEATABLEREAD2- [22696] postgres@db2 LOG:  execute<unnamed>: DECLARE c1 CURSORFORSELECT customerid, firstname FROMpublic.customers
3- [22696] postgres@db2 LOG:  statement: FETCH100FROM c1
[22696] postgres@db2 LOG:  statement: FETCH100FROM c1
until20000rows4- [22696] postgres@db2 LOG:  statement: CLOSE c1
5- [22696] postgres@db2 LOG:  statement: COMMITTRANSACTION

What has happened?

  1. A transaction began with ISOLATION LEVEL REPEATABLE READ, in a connections to db2 database-
  2. A cursor was defined with the query to try to fetch the data.
  3. The data was fetched using 100 row bunches.
  4. The cursor was closed.
  5. The transaction was committed.

Therefore a possible optimization would be to increase the values of fetch_size to try to reduce the number of fetch activity.

ALTERFOREIGNTABLE  f_tables.customers OPTIONS ( fetch_size '10000' );

---the new plan
Hash Join  (cost=470.00..509.92rows=748 width=148) (actual time=11.943..33.022rows=12000 loops=1)
  Output: o.orderid, o.orderdate, o.customerid, o.netamount, o.tax, o.totalamount, ftc.firstname
  Hash Cond: (ftc.customerid = o.customerid)
  ->Foreign Scan on f_tables.customers ftc  (cost=100.00..126.83rows=561 width=122) (actual time=9.173..22.924rows=20000 loops=1)
        Output: ftc.customerid, ftc.firstname
        Remote SQL: SELECT customerid, firstname FROMpublic.customers
  ->  Hash  (cost=220.00..220.00rows=12000 width=30) (actual time=2.752..2.752rows=12000 loops=1)
        Output: o.orderid, o.orderdate, o.customerid, o.netamount, o.tax, o.totalamount
        Buckets: 16384  Batches: 1  Memory Usage: 889kB
        ->  Seq Scan onpublic.orders o  (cost=0.00..220.00rows=12000 width=30) (actual time=0.012..1.180rows=12000 loops=1)
              Output: o.orderid, o.orderdate, o.customerid, o.netamount, o.tax, o.totalamount
Planning Time: 0.131 ms
Execution Time: 33.794 ms

And the logs:

[18580] postgres@db1 LOG:  execute<unnamed>: explain (analyze,verbose) select o.*,ftc.firstname from orders o join f_tables.customers as ftc using (customerid)
[22696] postgres@db2 LOG:  statement: STARTTRANSACTIONISOLATIONLEVELREPEATABLEREAD
[22696] postgres@db2 LOG:  execute<unnamed>: DECLARE c1 CURSORFORSELECT customerid, firstname FROMpublic.customers
[22696] postgres@db2 LOG:  statement: FETCH10000FROM c1
...until20000
[22696] postgres@db2 LOG:  statement: CLOSE c1
[22696] postgres@db2 LOG:  statement: COMMITTRANSACTION

Obviously, the transaction will require fewer Fetch operations to get all data, this will translate in less time executing the query, as the EXPLAIN output shows us. So if is necessary to process a large number of rows from FDW please take this option into consideration.

2: Extensions

The PostgreSQL planner can decide if any WHERE operations could be executed on a remote host as long as it is safe to do so, in this case only WHERE clauses using built-in functions are considered safe, but the extensions option can help us to define our own functions and operators as safe (packaged in extension form) and this will indicate to the PostgreSQL’s planner that it is safe to execute WHERE operations on the remote host and this will translate into a better performance. For this, it is necessary to package the functions in an extension way and set them as IMMUTABLE. This option can only be specified for SERVER objects.

For example:

EXPLAIN (analyze,verbose) select o.*,ftc.firstname from orders o join f_tables.customers as ftc using (customerid)  where super_age(ftc.age)

    Hash Join  (cost=470.00..642.84rows=249 width=148) (actual time=18.702..59.381rows=525 loops=1)
  Output: o.orderid, o.orderdate, o.customerid, o.netamount, o.tax, o.totalamount, ftc.firstname
  Hash Cond: (ftc.customerid = o.customerid)
  ->Foreign Scan on f_tables.customers ftc  (cost=100.00..268.48rows=187 width=122) (actual time=14.233..54.635rows=842 loops=1)
        Output: ftc.customerid, ftc.firstname, ftc.lastname
        Filter: super_age((ftc.age)::integer)
        Rows Removed by Filter: 19158
        Remote SQL: SELECT customerid, firstname, age FROMpublic.customers
  ->  Hash  (cost=220.00..220.00rows=12000 width=30) (actual time=4.282..4.282rows=12000 loops=1)
        Output: o.orderid, o.orderdate, o.customerid, o.netamount, o.tax, o.totalamount
        Buckets: 16384  Batches: 1  Memory Usage: 889kB
        ->  Seq Scan onpublic.orders o  (cost=0.00..220.00rows=12000 width=30) (actual time=0.022..1.659rows=12000 loops=1)
              Output: o.orderid, o.orderdate, o.customerid, o.netamount, o.tax, o.totalamount
Planning Time: 0.197 ms
Execution Time: 60.013 ms

In this case, if the function super_age is packaged as an extension and sets the name to the extension option the result changes

CREATE EXTENSION fdw_functions; -- the extension must be loaded in both server (origin and remote)
ALTER SERVER pgserver  OPTIONS (extensions 'fdw_functions');
    Hash Join  (cost=470.00..635.36rows=249 width=148) (actual time=39.761..39.985rows=525 loops=1)
  Output: o.orderid, o.orderdate, o.customerid, o.netamount, o.tax, o.totalamount, ftc.firstname
  Hash Cond: (ftc.customerid = o.customerid)
  ->Foreign Scan on f_tables.customers ftc  (cost=100.00..261.00rows=187 width=122) (actual time=37.855..37.891rows=842 loops=1)
        Output: ftc.customerid, ftc.firstname, ftc.lastname
        Remote SQL: SELECT customerid, firstname FROMpublic.customers WHERE (public.super_age(age))
  ->  Hash  (cost=220.00..220.00rows=12000 width=30) (actual time=1.891..1.892rows=12000 loops=1)
        Output: o.orderid, o.orderdate, o.customerid, o.netamount, o.tax, o.totalamount
        Buckets: 16384  Batches: 1  Memory Usage: 889kB
        ->  Seq Scan onpublic.orders o  (cost=0.00..220.00rows=12000 width=30) (actual time=0.009..0.842rows=12000 loops=1)
              Output: o.orderid, o.orderdate, o.customerid, o.netamount, o.tax, o.totalamount
Planning Time: 0.162 ms
Execution Time: 40.364 ms

Therefore if it is necessary to filter some data in the WHERE clause using its own function, bear in mind this option provided by FDW.

3: Use_remote_estimate

PostgreSQL’s planner decides which is the better strategy to execute the queries, this decision is based on the table’s statistics, so, if the statistics are outdated the planner will choose a bad execution strategy. Autovacuum process will help to keep statistics updated, but does not execute ANALYZE commands on foreign tables, hence to keep statistics updated on foreign tables it is necessary to run manually ANALYZE on those tables. To avoid outdated statistics, the FDW has an option to get the statistics from the remote server on the fly, this option is precisely use_remote_estimate and can be specified for a foreign table or a foreign server.

Example :

EXPLAIN (analyze,verbose) select o.*,ftc.firstname from orders o join f_tables.customers as ftc using (customerid)    where gender ='F'
   
Hash Join  (cost=470.00..1212.33rows=133 width=148) (actual time=38.572..43.333rows=5947 loops=1)
  Output: o.orderid, o.orderdate, o.customerid, o.netamount, o.tax, o.totalamount, ftc.firstname
  Hash Cond: (ftc.customerid = o.customerid)
  ->Foreign Scan on f_tables.customers ftc  (cost=100.00..840.00rows=100 width=122) (actual time=34.704..35.651rows=10010 loops=1)
        Output: ftc.customerid, ftc.firstname, ftc.lastname
        Remote SQL: SELECT customerid, firstname FROMpublic.customers WHERE ((gender ='F'::text))
  ->  Hash  (cost=220.00..220.00rows=12000 width=30) (actual time=3.846..3.847rows=12000 loops=1)
        Output: o.orderid, o.orderdate, o.customerid, o.netamount, o.tax, o.totalamount
        Buckets: 16384  Batches: 1  Memory Usage: 889kB
        ->  Seq Scan onpublic.orders o  (cost=0.00..220.00rows=12000 width=30) (actual time=0.017..1.606rows=12000 loops=1)
              Output: o.orderid, o.orderdate, o.customerid, o.netamount, o.tax, o.totalamount
Planning Time: 0.440 ms
Execution Time: 44.203 ms

Defining use_remote_estimate to on the plan will change:

ALTERFOREIGNTABLE  f_tables.customers  OPTIONS ( use_remote_estimate 'on' );

Hash Join  (cost=1163.33..1488.38rows=6006 width=37) (actual time=13.584..17.607rows=5947 loops=1)
  Output: o.orderid, o.orderdate, o.customerid, o.netamount, o.tax, o.totalamount, ftc.firstname
  Hash Cond: (o.customerid = ftc.customerid)
  ->  Seq Scan onpublic.orders o  (cost=0.00..220.00rows=12000 width=30) (actual time=0.034..1.070rows=12000 loops=1)
        Output: o.orderid, o.orderdate, o.customerid, o.netamount, o.tax, o.totalamount
  ->  Hash  (cost=1038.20..1038.20rows=10010 width=11) (actual time=13.507..13.508rows=10010 loops=1)
        Output: ftc.firstname, ftc.customerid
        Buckets: 16384  Batches: 1  Memory Usage: 559kB
        ->Foreign Scan on f_tables.customers ftc  (cost=100.00..1038.20rows=10010 width=11) (actual time=11.719..12.516rows=10010 loops=1)
              Output: ftc.firstname, ftc.customerid
              Remote SQL: SELECT customerid, firstname FROMpublic.customers WHERE ((gender ='F'::text))
Planning Time: 1.173 ms
Execution Time: 17.974 ms

And the logs:

[18580] postgres@db1 LOG:  execute<unnamed>: explain (analyze,verbose) select o.*,ftc.firstname from orders o join f_tables.customers as ftc using (customerid)   where gender ='F'
[22696] postgres@db2 LOG:  statement: STARTTRANSACTIONISOLATIONLEVELREPEATABLEREAD
[22696] postgres@db2 LOG:  statement: EXPLAINSELECT customerid, firstname FROMpublic.customers WHERE ((gender ='F'::text))
[22696] postgres@db2 LOG:  statement: EXPLAINSELECT customerid, firstname FROMpublic.customers WHERE ((gender ='F'::text)) ORDERBY customerid ASC NULLS LAST
[22696] postgres@db2 LOG:  statement: EXPLAINSELECT customerid, firstname FROMpublic.customers WHERE ((((SELECTnull::integer)::integer) = customerid)) AND ((gender ='F'::text))
[22696] postgres@db2 LOG:  execute<unnamed>: DECLARE c1 CURSORFORSELECT customerid, firstname FROMpublic.customers WHERE ((gender ='F'::text))
[22696] postgres@db2 LOG:  statement: FETCH10000FROM c1
[22696] postgres@db2 LOG:  statement: FETCH10000FROM c1
[22696] postgres@db2 LOG:  statement: CLOSE c1
[22696] postgres@db2 LOG:  statement: COMMITTRANSACTION

As shown in the logs, many EXPLAIN commands are thrown to the remote server and obviously the planning time will increase, but the planner gets better statistics to decide which is the better execution plan, which translates into better execution times.

Pool

Another improvement could be using a pooling solution, like pgbouncer, with static connections for FDWs. This would reduce the overhead of establishing new connections each time is required. Benchmarks about the benefits of using pgBouncer can be found: here and there. If you want to test FDW with pgbouncer by yourself you can use this labs or you can check the results of this test here, where it is possible to see the positive impact of using a pgbouncer for pooling the FDW connections .

Conclusions

The tips and examples shown above have shown us that sometimes with minimum changes if we are using FDW in our queries in PostgreSQL we can get some reasonable performance benefits. Of course you must analyze if each tip fixes your data scenery. In new PostgreSQL’s releases, it is possible that new options will appear to boost our queries with FDW. If you know any other tips about FDW performance, please feel free to share them with us.

David Christensen: Devious SQL: Dynamic DDL in PostgreSQL

$
0
0

Supporting PostgreSQL DBAs is an important part of daily life here at Crunchy Data. I’ve recently run across a few use cases where utility queries based on the current state of the database are needed. A simple example could be where you have a table that is the target of logical replication and the id column becomes out of sync with the sequence that generated the data. This would result in new rows having primary key conflicts. To correct this issue, you would need to set the sequence to generate values past the current max value in the table.

This example is part of a larger class of problems which are best solved with functionality that SQL by itself does not directly provide: Dynamic DDL. Data Definition Language (DDL) in SQL itself is notoriously non-dynamic, with strict parsing rules, predefined data types, table structures, and queries based on known and articulated columns.

So how can we bend SQL to our will and execute Dynamic DDL Postgres queries without having to manually write these queries each time? In this next installment of my Devious SQL series (see posts #1 and #2), I’ll show you some SQL approaches to get the job done.

Altering sequence restart values

Let us again consider a scenario where we want to explicitly provide the RESTART value for a sequence via a query. This is an easy thing to express in terms of what we would like to do: we want to reset a sequence to start after the current maximum value of the table it is associated with.

Trying the naïve approach, we get:

ALTER SEQUENCE big_table_id_seq RESTART (SELECT max(id) + 1 FROM big_table);
ERROR:  syntax error at or near "(", at character 41
STATEMENT:  ALTER SEQUENCE big_table_id_seq RESTART (SELECT max(id) + 1 FROM big_table);

As we can see, this approach isn't supported by the PostgreSQL grammar, as it is expecting an actual value here, not a subquery (as nice as that would be).

So what are some approaches here?

Using psql variable substitution

If we are using psql, we have a few options on how to solve this problem. One approach is using psql variables and first selecting the value we want into a variable, then substituting this value into the expression we pass to psql:

-- use \gset to set a psql variable with the results of this query
SELECT max(id) + 1 as big_table_max from big_table \gset
-- substitute the variable in a new query
ALTER SEQUENCE big_table_id_seq RESTART :big_table_max ;
ALTER SEQUENCE

In this example, we are using the \gset command to capture the results of the first query and store it for use later in the psql session. We then interpolate this variable into our expression using the :big_table_max syntax, which will be passed directly to the PostgreSQL server.

Using psql's \gexec command

Another method of utilizing psql for dynamic SQL is constructing the query as a SELECT statement returning the statements you wish to run, then using the \gexec command to execute the underlying queries. First let's look at making ourselves a query that returns the statement we want, then we'll run this statement using \gexec:

SELECT 'ALTER SEQUENCE big_table_id_seq RESTART ' || max(id) + 1 as query FROM big_table;
SELECT 'ALTER SEQUENCE big_table_id_seq RESTART ' || max(id) + 1 as query FROM big_table \gexec
query
ALTER SEQUENCE big_table_id_seq RESTART 100001
ALTER SEQUENCE

A benefit of this approach compared to the variable substitution one is that this can work with more complex statements and multiple return values, so you could construct queries based on arbitrary conditions and generate more than one SQL query; the first implementation is limited to queries that return single rows at a time. This also gives you a preview of the underlying SQL statement that you will be running before you execute it against the server with \gexec, so provides some level of safety if you were doing some sort of destructive action in the query.

Dynamic SQL without psql

Not everyone uses psql as the interface to PostgreSQL, despite its obvious superiority :-), so are there ways to support dynamic SQL using only server-side tools? As it so happens there are several, using basically the same approach of writing a plpgsql snippet to generate the query, then EXECUTE to run the underlying utility statement. These roughly correlate to the approaches in the psql section above in that they work best for single or multiple dynamic statements.

DO blocks

To use server-side Dynamic SQL we will need to construct our queries using plpgsql and execute the underlying text as if we were issuing the underlying query ourselves.

DO $$
BEGIN
    EXECUTE format('ALTER SEQUENCE big_table_id_seq RESTART %s', (SELECT max(id) + 1 FROM big_table));
END
$$
LANGUAGE plpgsql;
DO

In this case we are using PostgreSQL's built-in format() function which substitutes arguments similar to printf() in C-based languages. This allows us to interpolate the subquery result we were wanting in this case, resulting in a string that PostgreSQL can EXECUTE and giving us the result we want.

Create an exec() function

Almost identical in function to the DO block, we can also create a simple plpgsql function that simply calls EXECUTE on it input parameter like so:

CREATE OR REPLACE FUNCTION exec(raw_query text) RETURNS text AS $$
BEGIN
  EXECUTE raw_query;
  RETURN raw_query;
END
$$
LANGUAGE plpgsql;

SELECT exec(format('ALTER SEQUENCE big_table_id_seq RESTART %s', (SELECT max(id) + 1 FROM big_table)));
CREATE FUNCTION
exec
ALTER SEQUENCE big_table_id_seq RESTART 100001

This may seem like a fairly pointless change compared to the previous approach, as we have basically only moved our query into a parameter that we pass in, but what it buys us is the ability to call this function against a list of queries that we construct using normal SQL, giving us the option of running each in turn.

Restrictions

So what type of SQL can be run in each of these sorts of approaches, and are there any restrictions in what we can run via Dynamic SQL with these methods? The main consideration about the different approaches is related to commands that need to be run outside of an explicit transaction block.

Consider if we wanted to run a REINDEX CONCURRENTLY on all known indexes, so we used the exec() approach to construct a REINDEX CONCURRENTLY statement for all indexes in the public schema:

SELECT
    exec(format('REINDEX INDEX CONCURRENTLY %I', relname))
FROM
    pg_class
JOIN
    pg_namespace
ON pg_class.relnamespace = pg_namespace.oid
WHERE
    relkind = 'i' AND
    nspname = 'public'
ERROR:  REINDEX CONCURRENTLY cannot be executed from a function
CONTEXT:  SQL statement "REINDEX INDEX CONCURRENTLY big_table_pkey"
PL/pgSQL function exec(text) line 3 at EXECUTE

As you can see here, this won't work as a function due to REINDEX CONCURRENTLY needing to manage its own transaction state; in PostgreSQL, functions inherently run inside a transaction to allow the impact of a function to either completely succeed or completely fail. (Atomicity in ACID.)

Let's try this using \gexec:

SELECT
    format('REINDEX INDEX CONCURRENTLY %I', relname)
FROM
    pg_class
JOIN
    pg_namespace
ON pg_class.relnamespace = pg_namespace.oid
WHERE
    relkind = 'i' AND
    nspname = 'public'
\gexec
REINDEX

Since the \gexec handling is done by psql, the resulting statement is effectively run at the top-level as if it appeared literally in the SQL file.

More advanced usage

Look for a followup blog article where I go into more advanced techniques using Dynamic SQL, particularly using the \gexec function or exec() itself. Until next time, stay devious1!

Footnotes:

1Devious: longer and less direct than the most straightforward way.

Laurenz Albe: Gaps in sequences in PostgreSQL

$
0
0

when gaps in sequences are a real problem
© Laurenz Albe 2021

Most database tables have an artificial numeric primary key, and that number is usually generated automatically using a sequence. I wrote about auto-generated primary keys in some detail in a previous article. Occasionally, gaps in these primary key sequences can occur – which might come as a surprise to you.

This article shows the causes of sequence gaps, demonstrates the unexpected fact that sequences can even jump backwards, and gives an example of how to build a gapless sequence.

Gaps in sequences caused by rollback

We are used to the atomic behavior of database transactions: when PostgreSQL rolls a transaction back, all its effects are undone. As the documentation tells us, that is not the case for sequence values:

To avoid blocking concurrent transactions that obtain numbers from the same sequence, a nextval operation is never rolled back; that is, once a value has been fetched it is considered used and will not be returned again. This is true even if the surrounding transaction later aborts, or if the calling query ends up not using the value. For example an INSERT with an ON CONFLICT clause will compute the to-be-inserted tuple, including doing any required nextval calls, before detecting any conflict that would cause it to follow the ON CONFLICT rule instead. Such cases will leave unused “holes” in the sequence of assigned values.

This little example shows how a gap forms in a sequence:

CREATE TABLE be_positive (
   id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
   value integer CHECK (value > 0)
);

-- the identity column is backed by a sequence:
SELECT pg_get_serial_sequence('be_positive', 'id');

   pg_get_serial_sequence   
════════════════════════════
 laurenz.be_positive_id_seq
(1 row)

INSERT INTO be_positive (value) VALUES (42);
INSERT 0 1

INSERT INTO be_positive (value) VALUES (-99);
ERROR:  new row for relation "be_positive" violates
        check constraint "be_positive_value_check"
DETAIL:  Failing row contains (2, -99).

INSERT INTO be_positive (value) VALUES (314);
INSERT 0 1

TABLE be_positive;

 id │ value 
════╪═══════
  1 │    42
  3 │   314
(2 rows)

The second statement was rolled back, but the sequence value 2 is not, forming a gap.

This intentional behavior is necessary for good performance. After all, a sequence should not be the bottleneck for a workload consisting of many INSERTs, so it has to perform well. Rolling back sequence values would reduce concurrency and complicate processing.

Gaps in sequences caused by caching

Even though nextval is cheap, a sequence could still be a bottleneck in a highly concurrent workload. To work around that, you can define a sequence with a CACHE clause greater than 1. Then the first call to nextval in a database session will actually fetch that many sequence values in a single operation. Subsequent calls to nextval use those cached values, and there is no need to access the sequence.

As a consequence, these cached sequence values get lost when the database session ends, leading to gaps:

CREATE SEQUENCE seq CACHE 20;

SELECT nextval('seq');

 nextval 
═════════
       1
(1 row)

SELECT nextval('seq');

 nextval 
═════════
       2
(1 row)

Now end the database session and start a new one:

SELECT nextval('seq');

 nextval 
═════════
      21
(1 row)

Gaps in sequences caused by a crash

As with all other objects, changes to sequences are logged to WAL, so that recovery can restore the state from a backup or after a crash. Since writing WAL impacts performance, not each call to nextval will log to WAL. Rather, the first call logs a value 32 numbers ahead of the current value, and the next 32 calls to nextval don’t log anything. That means that after recovering from a crash, the sequence may have skipped some values.

To demonstrate, I’ll use a little PL/Python function that crashes the server by sending a KILL signal to the current process:

CREATE FUNCTION seppuku() RETURNS void
   LANGUAGE plpython3u AS
'import os, signal
os.kill(os.getpid(), signal.SIGKILL)';

Now let’s see this in action:

CREATE SEQUENCE seq;

SELECT nextval('seq');

 nextval 
═════════
       1
(1 row)

SELECT seppuku();
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

Upon reconnect, we find that some values are missing:

SELECT nextval('seq');

 nextval 
═════════
      34
(1 row)

Sequences that jump backwards after a crash

It is a little-known fact that sequences can also jump backwards. A backwards jump can happen if the WAL record that logs the advancement of the sequence value has not yet been persisted to disk. Why? Because the transaction that contained the call to nextval has not yet committed:

CREATE SEQUENCE seq;

BEGIN;

SELECT nextval('seq');

 nextval 
═════════
       1
(1 row)

SELECT nextval('seq');

 nextval 
═════════
       2
(1 row)

SELECT nextval('seq');

 nextval 
═════════
       3
(1 row)

SELECT seppuku();
psql:seq.sql:9: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

Now reconnect and fetch the next sequence value:

SELECT nextval('seq');

 nextval 
═════════
       1
(1 row)

This looks scary, but no damage can happen to the database: since the transaction didn’t commit, it was rolled back, along with all possible data modifications that used the “lost” sequence values.

However, that leads to an interesting conclusion: don’t use sequence values from an uncommitted transaction outside that transaction.

How to build a gapless sequence

First off: think twice before you decide to build a gapless sequence. It will serialize all transactions that use that “sequence”. That will deteriorate your data modification performance considerably.

You almost never need a gapless sequence. Usually, it is good enough if you know the order of the rows, for example from the current timestamp at the time the row was inserted. Then you can use the row_number window function to calculate the gapless ordering while you query the data:

SELECT created_ts,
       value,
       row_number() OVER (ORDER BY created_ts) AS gapless_seq
FROM mytable;

You can implement a truly gapless sequence using a “singleton” table:

CREATE TABLE seq (id bigint NOT NULL);

INSERT INTO seq (id) VALUES (0);

CREATE FUNCTION next_val() RETURNS bigint
   LANGUAGE sql AS
'UPDATE seq SET id = id + 1 RETURNING id';

It is important not to create an index on the table, so that you can get HOT updates– and so that the table does not get bloated.

Calling the next_val function will lock the table row until the end of the transaction, so keep all transactions that use it short.

Conclusion

I’ve shown you several different ways to make a sequence skip values — sometimes even backwards. But that is never a problem, if all you need are unique primary key values.

Resist the temptation to try for a “gapless sequence”. You can get it, but the performance impact is high.

If you are interested in learning about advanced techniques to enforce integrity, check out our blogpost on constraints over multiple rows.

The post Gaps in sequences in PostgreSQL appeared first on CYBERTEC.

Paul Brebner: Kafka Connect and Elasticsearch vs. PostgreSQL Pipelines: Final Performance Results (Pipeline Series Part 9)

$
0
0

In Part 6 and Part 7 of the pipeline series we took a different path in the pipe/tunnel and explored PostgreSQL and Apache Superset, mainly from a functional perspective—how can you get JSON data into PostgreSQL from Kafka Connect, and what does it look like in Superset. In Part 8, we ran some initial load tests and found out how the capacity of the original Elasticsearch pipeline compared with the PostgreSQL variant. These results were surprising (PostgreSQL 41,000 inserts/s vs. Elasticsearch 1,800 inserts/s), so in true “MythBusters” style we had another attempt to make them more comparable. 

Explosions were common on the TV show “MythBusters”!
(Source: Shutterstock)

1. Apples-to-Oranges Comparison

Next, I tried the classic “Myth Busters” second attempt methodology: just make it work! For this approach I discarded the Kafka Connect Elasticsearch sink connector and used an Elasticsearch client (The OpenSearch Python Client) directly to generate the load into Kafka (okay, this is really more like “Apples-to-weirdest fruit you can think of” comparison at this point). This client supports the security plugin, but I did have to modify it to send my example JSON tidal data, and to use the Bulk API to index multiple documents at a time.

The results were definitely better this time around:

  1. First, using the non-Bulk API, the maximum capacity was 5,000 index operations a second (with 60 client processes, and 50% Elasticsearch data nodes CPU utilization).
  2. Second, using the Bulk API the maximum capacity increased to a more comparable 37,500 index operations a second (50 client processes, 80% Elasticsearch data nodes CPU, 16% master nodes CPU)

Here’s what the Elasticsearch results look like. It’s apparent that using the Bulk API makes the biggest difference, probably because the HTTP protocol is much slower than the index operation. 

Elasticsearch Index sink connector

The results of the “just make it work” apples-to-oranges attempt between the PostgreSQL and Elasticsearch pipelines are now very comparable (well within a typical 10% experimental error margin), at around 40,000 inserts/s:

Elasticsearch vs PostgreSQL pipeline

Now, 40,000 inserts/s translates to an impressive 3.4 billion inserts per day, way more than we would need for our example NOAA tidal data pipeline. Doing some back-of-the-envelope calculations, we have a maximum of 500 NOAA tidal stations, with 10 metrics each, refreshing every 6 minutes, so we only need 14 inserts/s. However, if the scope of our pipeline increases, say to all the NOAA land-based weather stations (30,000, and there are even more stations in the ocean and air, etc.), which have on average of say 10 metrics, and say a refresh rate every 10 seconds, then the load is a more demanding 30,500 inserts/s. Both of the prototype pipeline sink systems would cope with this load (although I haven’t taken into account the extra load on the Kafka connect cluster, due to the Kafka source connectors). This is a good example of “scalability creep”—it’s always a good idea to plan for future growth in advance.

2. Comparing Elasticsearch and PostgreSQL Replication Settings

Given my relative lack of knowledge about PostgreSQL (which is okay as we have lots of PostgreSQL experts available due to the recent acquisition of Credativ, I wondered if the replication settings for PostgreSQL were really comparable with Elasticsearch. 

Elasticsearch uses synchronous replication—each index operation succeeds only after writing to the primary, and (concurrently) every replica also acknowledges it. As far as I can tell, this only happens after the write request is both written to Lucene (into memory) and to the disk log. This achieves both high durability and consistency, and the data can be immediately read from the replicas.  

Here’s what I found out about PostgreSQL replication.

First, to confirm if synchronous replication is enabled (it is enabled on the Instaclustr managed PostgreSQL service, as long as you request more than 1 node at creation time), use a PostgreSQL client (e.g. psql), connect to the cluster, and run:

  • SHOW synchronous_commit
    • this will be on, and 
  • SHOW synchronous_standby_names
    • which will show 1 or more node names,
    • depending on how many nodes you specified at cluster creation – e.g. if 2 nodes, then there will be 1 name, if 3 nodes, then there will be 2 names

Second, “on” is the default commit mode, but there are more possibilities. Multiple options are available and include master-only or master and replicas, durability, and consistency. Here’s my summary table:

synchronous_ commitCommit ack whenMaster durabilityReplica durabilityReplica consistencyLag (increasing)
offmaster only acks             (in memory)FFF0
localmaster only flush               (to disk)TFF1
remote_writereplicas ack                     (in memory)TFF2
on (default)replicas flush                 (to disk)TTF3
remote_applyreplicas applied (available for reads)TTT4

The off and local modes provide no replication, whereas the rest do. All modes except off provide master durability (in the case of server failure, the data is persisted), but only on and remote_apply provide replica durability. Finally, remote_apply is the only option that also provides replica read consistency, as the data is made available for reads before the ack is sent. Note that for remote_write the data will eventually be written to disk (assuming the server hasn’t crashed – 200ms is the default time for flushes), and for remote_apply, the data will be made available for reads, but it’s just done asynchronously. 

Each option will take longer to acknowledge and will in theory, therefore, reduce the throughput of the system.  This also seems to be the case in practice as shown in this graph ( results normalized to best-case off mode):

The remote_apply mode is indeed the worst case for performance, with exactly half of the throughput of the fastest off mode. So, why is this relevant for the pipeline comparison? Well, the PostgreSQL results were obtained using the default on mode. However, a more directly comparable mode, in terms of durability and consistency, is the remote_apply mode.  However, in practice the performance difference between remote_apply and on is small (only about 4%) so the results are good enough for a rough comparison. 

To get these results, I also had to find out how to set the scope of the settings correctly. This is because synchronous_commit can be set at the scope of transactions, sessions, users, databases, and instances (here’s a good blog to read) so I set it at the user level as follows:

ALTER USER username SET synchronous_commit=remote_apply;

Finally, I was curious to see what the lags were for my running system. The best way I could find to see this was using this command:

select * from pg_stat_replication;

This returns information including write_lag (corresponding to remote_write delay), flush_lag (remote_flush delay), and replay_lag (remote_apply delay), along with the sync_state of the server (which is useful confirmation that it’s working as expected). See this documentation for an explanation of the metrics. This confirmed that the replication was keeping up (which was logical as the CPU utilization on the replica server was minimal). 

3. Result Caveats

(Source: Shutterstock)

Given the apples-to-dragon fruit (the oddest fruit that I actually like) nature of these results, here are a few caveats to take into account:

  • Replication for both Elasticsearch and PostgreSQL was “2” (i.e. 1 primary copy of data and 1 replica copy). The impact of increasing this to a recommended “3” was not measured.
  • The PostgreSQL synchronous_commit modes can have a big impact on throughput, and you should check if the default value of “on” meets your requirements for performance, durability, consistency, and availability (which I didn’t mention above, but there is another setting to control how many replicas must reply before a commit. With 1 replica these won’t have any impact, but with more replicas they may). 
  • Not being able to push the PostgreSQL master CPU higher than 50% was surprising, and suggests either hardware or software settings bottlenecks (this may also be related to the database connections).
  • The PostgreSQL results were obtained using our internal managed PostgreSQL preview service which had not yet been fully optimized for performance. The public preview (recently announced) and GA releases are likely to be more highly tuned. 
  • Elasticsearch scales horizontally with more nodes and shards, and PostgreSQL scales vertically with larger server sizes. I didn’t try further scaling for this comparison.
  • The comparison used different types of sink system clients, so it really isn’t on a level playing field:
    • The PostgreSQL client was a Kafka Connect sink connector but had potential issues with database pooling and leaks, so it’s unlikely that it had the optimal number of database connections
    • The Elasticsearch client was a customized Python client due to issues with the security plug-in and lack of Bulk API support in my Kafka sink connector 
  • In theory, PostgreSQL could benefit from the equivalent to the Elasticsearch Bulk API (e.g. multirow inserts perhaps), but I didn’t try it, but I don’t expect it to have as much impact as the Elasticsearch Bulk API, given that the PostgreSQL message-based TCP/IP protocol is likely more efficient than HTTP

However, the identical JSON example data was used for both alternatives, the cluster sizes and costs are very similar, and the results are, perhaps unexpectedly, very similar. Anyway, my goal wasn’t really to “race” them against each other, but rather to get a feel for the likely throughput and to confirm that they were both sensible choices for a pipeline. So far so good!

In the next and final blog of this series, we’ll sum up and evaluate the technologies from multiple perspectives.

Follow the Pipeline Series

  • Part 1: Building a Real-Time Tide Data Processing Pipeline Part 1
  • Part 2: Building a Real-Time Tide Data Processing Pipeline Part 2
  • Part 3: Getting to Know Apache Camel Kafka Connectors
  • Part 4: Monitoring Kafka Connect Pipeline Metrics with Prometheus
  • Part 5: Scaling Kafka Connect Streaming Data Processing
  • Part 6: Streaming JSON Data Into PostgreSQL Using Open Source Kafka Sink Connectors
  • Part 7: Using Apache Superset to Visualize PostgreSQL JSON Data
  • Part 8: Kafka Connect Elasticsearch vs. PostgreSQL Pipelines: Initial Performance Results
  • Part9: Kafka Connect and Elasticsearch vs. PostgreSQL Pipelines: Final Performance Results

The post Kafka Connect and Elasticsearch vs. PostgreSQL Pipelines: Final Performance Results (Pipeline Series Part 9) appeared first on Instaclustr.

Andreas 'ads' Scherbaum: Ivan Panchenko

$
0
0
PostgreSQL Person of the Week Interview with Ivan Panchenko: My name is Ivan Panchenko, I live in Moscow, Russia. I am 49 now. Married, 5 children (already grown up).

Nikolay Samokhvalov: Useful queries to analyze PostgreSQL lock trees (a.k.a. lock queues)

$
0
0
For OLTP workloads (such as web and mobile applications), it is important to understand object-level and row-level locks in PostgreSQL. There are several good materials that I can recommend reading:

Jonathan Katz: Projecting Monthly Revenue Run Rate in Postgres

$
0
0

Monthly recurring revenue (MRR) and annual recurring revenue (ARR) are important metrics for a subscription-based business model. It allows for the business to project its available capital to make important decisions around expansion, hiring and more.

In an on-demand subscription model, MRR can fluctuate on any given day. This is especially true in the cloud-services world, where people are constantly spinning up and down services. That's why it makes sense to try and capture what the subscription revenue "run rate" is: this is a projection of how much money you are earning over a period of time based upon current service usage.

This exercise recently came up with a friend who was looking for a little help with analyzing and reporting on their SaaS business which is running on Crunchy Bridge. Because sometimes SQL is just fun I decided to roll up my sleeves and see what business insights we could get with a single query.

The twist here from my history is that this SaaS business was more of a utility model that you can start/stop subscriptions vs. having some defined plan for each month. (Think metered billing, apparently something my colleague Craig said was a big topic for Add-on Marketplaces in his Heroku days). Before I started running Postgres on Kubernetes, I was an application developer / a SQL wonk of sorts and just so happened to work for a business that had a subscription model. Thus, I was no stranger to the MRR/ARR queries.

So let's dive in.

Michał Mackiewicz: Does QGIS work with Postgres 14?

$
0
0

Performing a major upgrade of Postgres cluster poses a very little risk for the database itself. The binary data structure is quite stable (and because of that, a very quick upgrade using the pg_upgrade’s –link option is possible) and it’s highly unlikely that the data can be damaged or lost during the process.

But there is another risk: the application making use of data can stop working or lose some of its functionality. Why? Mostly because of changes in system catalogs and/or functions. For example, a catalog column can be dropped, or a function renamed. While PostGIS has a quite “soft” deprecation policy, and a calling deprecated function will result in a warning for a release or two, the core Postgres is rather a “move fast and break things” guy. For QGIS specifically, last serious problem was with upgrading from Postgres 11 to 12 which caused QGIS’s DB Manager to stop working (https://github.com/qgis/QGIS/issues/32321). This was also the case for ogr2ogr command-line utility.

Luckily, Postgres 14 release doesn’t contain any system catalog changes that affect QGIS, and both the “Add layer” and DB Manager interfaces work fine for QGIS versions at least as old as 3.14 “Pi”. So, if you run a spatial database with QGIS as the client app and consider upgrading: it’s safe to proceed with the upgrade.

David Kohn: Function pipelines: Building functional programming into PostgreSQL using custom operators

$
0
0

Today, we are announcing function pipelines, a new capability that introduces functional programming concepts inside PostgreSQL (and SQL) using custom operators.

Function pipelines radically improve the developer ergonomics of analyzing data in PostgreSQL and SQL, by applying principles from functional programming and popular tools like Python’s Pandas and PromQL.

At Timescale our mission is to serve developers worldwide, and enable them to build exceptional data-driven products that measure everything that matters: e.g., software applications, industrial equipment, financial markets, blockchain activity, user actions, consumer behavior, machine learning models, climate change, and more.

We believe SQL is the best language for data analysis. We’ve championed the benefits of SQL for several years, even back when many were abandoning the language for custom domain-specific languages. And we were right - SQL has resurged and become the universal language for data analysis, and now many NoSQL databases are adding SQL interfaces to keep up.

But SQL is not perfect, and at times can get quite unwieldy. For example,

SELECT device id, 
	sum(abs_delta) as volatility
FROM (
	SELECT device_id, 
		abs(val - lag(val) OVER (PARTITION BY device_id ORDER BY ts))
        	as abs_delta 
	FROM measurements
	WHERE ts >= now() - '1 day'::interval) calc_delta
GROUP BY device_id; 

Pop quiz: What does this query do?

Even if you are a SQL expert, queries like this can be quite difficult to read - and even harder to express. Complex data analysis in SQL can be hard.

Function pipelines let you express that same query like this:

SELECT device_id, 
	timevector(ts, val) -> sort() -> delta() -> abs() -> sum() 
    		as volatility
FROM measurements
WHERE ts >= now() - '1 day'::interval
GROUP BY device_id;

Now it is much clearer what this query is doing. It:

  • Gets the last day’s data from the measurements table, grouped by device_id
  • Sorts the data by the time column
  • Calculates the delta (or change) between values
  • Takes the absolute value of the delta
  • And then takes the sum of the result of the previous steps

Function pipelines improve your own coding productivity, while also making your SQL code easier for others to comprehend and maintain.

Inspired by functional programming languages, function pipelines enable you to analyze data by composing multiple functions, leading to a simpler, cleaner way of expressing complex logic in PostgreSQL.

And the best part: we built function pipelines in a way that is fully PostgreSQL compliant - we did not change any SQL syntax - meaning that any tool that speaks PostgreSQL will be able to support data analysis using function pipelines.

How did we build this? By taking advantage of the incredible extensibility of PostgreSQL, in particular: custom types, custom operators, and custom functions.

In our previous example, you can see the key elements of function pipelines:

  • Custom data types: in this case, the timevector, which is a set of (time, value) pairs.
  • Custom operator: ->, used to compose and apply function pipeline elements to the data that comes in.
  • And finally, custom functions; called pipeline elements. Pipeline elements can transform and analyze timevectors (or other data types) in a function pipeline. For this initial release, we’ve built 60 custom functions! (Full list here).

We’ll go into more detail on function pipelines in the rest of this post, but if you just want to get started as soon as possible, the easiest way to try function pipelines is through a fully managed Timescale Cloud service. Try it for free (no credit card required) for 30 days.

Function pipelines are pre-loaded on each new database service on Timescale Cloud, available immediately - so after you’ve created a new service, you’re all set to use them!

If you prefer to manage your own database instances, you can install the timescaledb_toolkit into your existing PostgreSQL installation, completely for free.

We’ve been working on this capability for a long time, but in line with our belief of “move fast but don’t break things”, we’re initially releasing function pipelines as an experimental feature - and we would absolutely love to get your feedback. You can open an issue or join a discussion thread in GitHub (And, if you like what you see, GitHub ⭐ are always welcome and appreciated too!).

We’d also like to take this opportunity to give a huge shoutout to pgx, the Rust-based framework for building PostgreSQL extensions - it handles a lot of the heavy lifting for this project. We have over 600 custom types, operators, and functions in the timescaledb_toolkit extension at this point; managing this without pgx (and the ease of use that comes from working with Rust) would be a real bear of a job.

Function pipelines: why are they useful?

It’s October. In the Northern hemisphere (where most of Team Timescale sits, including your authors), it is starting to get cold.

Now imagine a restaurant in New York City whose owners care about their customers and their customers’ comfort. And you are working on an IoT product designed to help small businesses like these owners minimize their heating bill while maximizing their customers happiness. So you install two thermometers, one at the front measuring the temperature right by the door, and another at the back of the restaurant.

Now, as many of you may know (if you’ve ever had to sit by the door of a restaurant in the fall or winter), when someone enters, the temperature drops - and once the door is closed, the temperature warms back up. The temperature at the back of the restaurant will vary much less than at the front, right by the door. And both of them will drop slowly down to a lower set point during non-business hours and warm back up sometime before business hours based on the setpoints on our thermostat. So overall we’ll end up with a graph that looks something like this:

A graph with time on the x axis and temperature on the y axis showing two curves. First a curve labeled back which starts low steadily rises stays relatively constant in the section labeled operating hours and then drops slowly back down. Second a curve labeled front which starts following the other, starts low, rises then it starts getting jumpy, drastically varying while the other stays constant during operating hours, then it falls back down.
A graph of the temperature at the front (near the door) and back. The back is much steadier, while the front is more volatile. Graph is for illustrative purposes only, data is fabricated. No restaurants or restaurant patrons were harmed in the making of this post.

As we can see, the temperature by the front door varies much more than at the back of the restaurant. Another way to say this is the temperature by the front door is more volatile. Now, the owners of this restaurant want to measure this because frequent temperature changes means uncomfortable customers.

In order to measure volatility, we could first subtract each point from the point before to calculate a delta. If we add this up directly, large positive and negative deltas will cancel out. But, we only care about the magnitude of the delta, not its sign - so what we really should do is take the absolute value of the delta, and then take the total sum of the previous steps.

We now have a metric that might help us measure customer comfort, and also the efficacy of different weatherproofing methods (for example, adding one of those little vestibules that acts as a windbreak).

To track this, we collect measurements from our thermometers and store them in a table:

CREATE TABLE measurements(
	device_id BIGINT,
	ts TIMESTAMPTZ,
	val DOUBLE PRECISION
);

The device_id identifies the thermostat, ts the time of reading and val the temperature.

Using the data in our measurements table, let’s look at how we calculate volatility using function pipelines.

Note: because all of the function pipeline features are still experimental, they exist in the toolkit_experimental schema. Before running any of the SQL code in this post you will need to set your search_path to include the experimental schema as we do in the example below, we won’t repeat this throughout the post so as not to distract.

set search_path to toolkit_experimental, public; --still experimental, so do this to make it easier to read

SELECT device_id, 
	timevector(ts, val) -> sort() -> delta() -> abs() -> sum() 
    	as volatility
FROM measurements
WHERE ts >= now()-'1 day'::interval
GROUP BY device_id;

And now we have the same query that we used as our example in the introduction.

In this query, the function pipeline timevector(ts, val) -> sort() -> delta() -> abs() -> sum() succinctly expresses the following operations:

  1. Create timevectors (more detail on this later) out of the ts and val columns
  2. Sort each timevector by the time column
  3. Calculate the delta (or change) between each pair in the timevector by subtracting the previous val from the current
  4. Take the absolute value of the delta
  5. Take the sum of the result from the previous steps

The FROM, WHERE and GROUP BY clauses do the rest of the work telling us:

  1. We’re getting data FROM the measurements table
  2. WHERE the ts, or timestamp column, contains values over the last day
  3. Showing one pipeline output per device_id (the GROUP BY column)

As we noted before, if you were to do this same calculation using SQL and PostgreSQL functionality, your query would look like this:

SELECT device id, 
sum(abs_delta) as volatility
FROM (
	SELECT 
		abs(val - lag(val) OVER (PARTITION BY device_id ORDER BY ts) ) 
        	as abs_delta 
	FROM measurements
	WHERE ts >= now() - '1 day'::interval) calc_delta
GROUP BY device_id; 

This does the same 5 steps as the above, but is much harder to understand, because we have to use a window function and aggregate the results - but also, because aggregates are performed before window functions, we need to actually execute the window function in a subquery.

As we can see, function pipelines make it significantly easier to comprehend the overall analysis of our data. There’s no need to completely understand what’s going on in these functions just yet, but for now it’s enough to understand that we’ve essentially implemented a small functional programming language inside of PostgreSQL. You can still use all of the normal, expressive SQL you’ve come to know and love. Function pipelines just add new tools to your SQL toolbox that make it easier to work with time-series data.

Some avid SQL users might find the syntax a bit foreign at first, but for many people who work in other programming languages, especially using tools like Python’s Pandas Package, this type of successive operation on data sets will feel natural.

And again, this is still fully PostgreSQL compliant: We introduce no changes to the parser or anything that should break compatibility with PostgreSQL drivers.

How we built function pipelines without forking PostgreSQL

We built function pipelines- without modifying the parser or anything that would require a fork of PostgreSQL- by taking advantage of three of the many ways that PostgreSQL enables extensibility: custom types, custom functions, and custom operators.

  • Custom data types, starting with the timevector, which is a set of (time, value) pairs
  • A custom operator: ->, which is used to compose and apply function pipeline elements to the data that comes in.
  • Custom functions, called pipeline elements, which can transform and analyze timevectors (or other data types) in a function pipeline (with 60 functions in this initial release)

We believe that new idioms like these are exactly what PostgreSQL was meant to enable. That’s why it has supported custom types, functions and operators from its earliest days. (And is one of the many reasons why we love PostgreSQL.)

A custom data type: the timevector

A timevector is a collection of (time, value) pairs. As of now, the times must be TIMESTAMPTZs and the values must be DOUBLE PRECISION numbers. (But this may change in the future as we continue to develop this data type. If you have ideas/input, please file feature requests on Github explaining what you’d like!)

You can think of the timevector as something like this:

 A box with `timevector` on the outside, inside there is a table with one column labeled time containing multiple timestamps and another column labeled value containing floating point numbers.
A depiction of a timevector.

One of the first questions you might ask is: how does a timevector relate to time-series data? (If you want to know more about time-series data, we have a great blog post on that).

Let’s consider our example from above, where we were talking about a restaurant that was measuring temperatures, and we had a measurements table like so:

CREATE TABLE measurements(
	device_id BIGINT,
	ts TIMESTAMPTZ,
	val DOUBLE PRECISION
);

In this example, we can think of a single time-series dataset as all historical and future time and temperature measurements from a device.

Given this definition, we can think of a timevector as a finite subset of a time-series dataset. The larger time-series dataset may extend back into the past and it may extend into the future, but the timevector is bounded.

A box labeled time-series with a table in it. The table has one column labeled time and another labeled value like the diagram above. There is a timevector box inside the table with timestamps and values. Each column also has an arrow at the top pointing to the word past and another arrow at the bottom pointing to the word future conveying that the time-series extends into the past and future while the timevector contains a subset of the values.
A timevector is a finite subset of a time-series and contains all the (time, value) pairs in some region of the time-series.

In order to construct a timevector from the data gathered from a thermometer, we use a custom aggregate and pass in the columns we want to become our (time, value) pairs. We can use the WHERE clause to define the extent of the timevector (i.e., the limits of this subset), and the GROUP BY clause to provide identifying information about the time-series that’s represented.

Building on our example, this is how we construct a timevector for each thermometer in our dataset:

SELECT device_id, 
	timevector(ts, val)
FROM measurements
WHERE ts >= now() - '1 day'::interval
GROUP BY device_id;

But a timevector doesn't provide much value by itself. So now, let’s also consider some complex calculations that we can apply to the timevector, starting with a custom operator used to apply these functions

A custom operator: ->


In function pipelines, the -> operator is used to apply and compose multiple functions, in an easy to write and read format.

Fundamentally, -> means: “apply the operation on the right to the inputs on the left”, or, more simply “do the next thing”.

We created a general-purpose operator for this because we think that too many operators meaning different things can get very confusing and difficult to read.

One thing that you’ll notice about the pipeline elements is that the arguments are in an unusual place in a statement like:

SELECT device_id, 
 	timevector(ts, val) -> sort() -> delta() -> abs() -> sum() 
    	as volatility
FROM measurements
WHERE ts >= now() - '1 day'::interval
GROUP BY device_id;

It appears (from the semantics) that the timevector(ts, val) is an argument to sort(), the resulting timevector is an argument to delta() and so on.

The thing is that sort() (and the others) are regular function calls; they can’t see anything outside of their parentheses and don’t know about anything to their left in the statement; so we need a way to get the timevector into the sort() (and the rest of the pipeline).

The way we solved this is by taking advantage of one of the same fundamental computing insights that functional programming languages use: code and data are really the same thing.

Each of our functions returns a special type that describes the function and its arguments. We call these types pipeline elements (more later).

The -> operator then performs one of two different types of actions depending on the types on its right and left sides.  It can either:

  1. Apply a pipeline element to the left hand argument - perform the function described by the pipeline element on the incoming data type directly.
  2. Compose pipeline elements into a combined element that can be applied at some point in the future (this is an optimization that allows us to apply multiple elements in a “nested” manner so that we don’t perform multiple unnecessary passes).

The operator determines the action to perform based on its left and right arguments.

Let’s look at our timevector from before: timevector(ts, val) -> sort() -> delta() -> abs() -> sum(). If you remember from before, I noted that this function pipeline performs the following steps:

  1. Create timevectors out of the ts and val columns
  2. Sort it by the time column
  3. Calculate the delta (or change) between each pair in the timevector by subtracting the previous val from the current
  4. Take the absolute value of the delta
  5. Take the sum of the result from the previous steps

And logically, at each step, we can think of the timevector being materialized and passed to the next step in the pipeline.

However, while this will produce a correct result, it’s not the most efficient way to compute this. Instead, it would be more efficient to compute as much as possible in a single pass over the data.

In order to do this, we allow not only the apply operation, but also the compose operation. Once we’ve composed a pipeline into a logically equivalent higher order pipeline with all of the elements we can choose the most efficient way to execute it internally. (Importantly, even if we have to perform each step sequentially, we don’t need to materialize it and pass it between each step in the pipeline so it has significantly less overhead even without other optimization).

Custom functions: pipeline elements

Now let’s discuss the third, and final, key piece that makes up function pipelines: custom functions, or as we call them, pipeline elements.

We have implemented over 60 individual pipeline elements, which fall into 4 categories (with a few subcategories):

timevector transforms

These elements take in a timevector and produce a timevector. They are the easiest to compose, as they produce the same type.

Example pipeline:

SELECT device_id, 
	timevector(ts, val) 
    	-> sort() 
        -> delta() 
        -> map($$ ($value^3 + $value^2 + $value * 2) $$) 
        -> lttb(100) 
FROM measurements

Organized by sub-category:

Unary mathematical

Simple mathematical functions applied to the value in each point in a timevector

ElementDescription
abs()Computes the absolute value of each value
cbrt()Computes the cube root of each value
ceil()Computes the first integer greater than or equal to each value
floor()Computes the first integer less than or equal to each value
ln()Computes the natural logarithm of each value
log10()Computes the base 10 logarithm of each value
round()Computes the closest integer to each value
sign()Computes +/-1 for each positive/negative value
sqrt()Computes the square root for each value
trunc()Computes only the integer portion of each value

Binary mathematical

Simple mathematical functions with a scalar input applied to the value in each point in a timevector.

ElementDescription
add(N)Computes each value plus N
div(N)Computes each value divided by N
logn(N)Computes the logarithm base N of each value
mod(N)Computes the remainder when each number is divided by N
mul(N)Computes each value multiplied by N
power(N)Computes each value taken to the N power
sub(N)Computes each value less N

Compound transforms

Transforms involving multiple points inside of a timevector

ElementDescription
delta()Subtracts each value from the previous`
fill_to(interval, fill_method)Fills gaps larger than interval with points at interval from the previous using fill_method
lttb(resolution)Downsamples a timevector using the largest triangle three buckets algorithm at `resolution, requires sorted input.
sort()Sorts the timevector by the time column ascending

Lambda Elements

These elements use lambda expressions, which allows the user to write small functions to be evaluated over each point in a timevector.
Lambda expressions can return a DOUBLE PRECISION value like $$ $value^2 + $value + 3 $$. They can return a BOOL like $$ $time > ‘2020-01-01’t $$ . They can also return a (time, value) pair like $$ ($time + ‘1 day’i, sin($value) * 4)$$. You can apply them using the elements below:

ElementDescription
filter(lambda (bool) )Removes points from the timevector where the lambda expression evaluates to false
map(lambda (value) )Applies the lambda expression to all the values in the timevector
map(lambda (time, value) )Applies the lambda expression to all the times and values in the timevector

timevector finalizers

These elements end the timevector portion of a pipeline, they can either help with output or  produce an aggregate over the entire timevector. They are an optimization barrier to composition as they (usually) produce types other than timevector.

Example pipelines:

SELECT device_id, 
	timevector(ts, val) -> sort() -> delta() -> unnest()
FROM measurements
SELECT device_id, 
	timevector(ts, val) -> sort() -> delta() -> time_weight()
FROM measurements

Finalizer pipeline elements organized by sub-category:

timevector output

These elements help with output, and can produce a set of (time, value) pairs or a Note: this is an area where we’d love further feedback, are there particular data formats that would be especially useful for, say graphing that we can add? File an issue in our GitHub!

ElementDescription
unnest( )Produces a set of (time, value) pairs. You can wrap and expand as a composite type to produce separate columns (pipe -> unnest()).*
materialize()Materializes atimevector to pass to an application or other operation directly, blocks any optimizations that would materialize it lazily.

timevector aggregates

Aggregate all the points in a timevector to produce a single value as a result.

ElementDescription
average()Computes the average of the values in the timevector
couter_agg()Computes the counter_agg aggregate over the times and values in the timevector
stats_agg()Computes a range of statistical aggregates and returns a 1DStatsAgg over the values in the timevector
sum()Computes the sum of the values in the timevector
num_vals()Counts the points in the timevector

Aggregate accessors and mutators

These function pipeline elements act like the accessors that I described in our previous post on aggregates. You can use them to get a value from the aggregate part of a function pipeline like so:

SELECT device_id, 
	timevector(ts, val) -> sort() -> delta() -> stats_agg() -> variance() 
FROM measurements

But these don’t just work on timevectors - they also work on a normally produced aggregate as well.

When used instead of normal function accessors and mutators they can make the syntax more clear by getting rid of nested functions like:

SELECT approx_percentile(0.5, percentile_agg(val)) 
FROM measurements

Instead, we can use the arrow accessor to convey the same thing:

SELECT percentile_agg(val) -> approx_percentile(0.5) 
FROM measurements

By aggregate family:

Counter Aggregates

Counter aggregates deal with resetting counters, (and were stabilized in our 1.3 release this week!). Counters are a common type of metric in the application performance monitoring and metrics world. All values have resets accounted for. These elements must have a CounterSummary to their left when used in a pipeline, from a counter_agg() aggregate or pipeline element.

ElementDescription
counter_zero_time()The time at which the counter value is predicted to have been zero based on the least squares fit of the points input to the CounterSummary(x intercept)
corr()The correlation coefficient of the least squares fit line of the adjusted counter value.
delta()Computes the last - first value of the counter
extapolated_delta(method)Computes the delta extrapolated using the provided method to bounds of range. Bounds must have been provided in the aggregate or a with_bounds call
idelta_left() / idelta_right()Computes the instantaneous difference between the second and first points (left) or last and next-to-last points (right)
intercept()The y-intercept of the least squares fit line of the adjusted counter value.
irate_left() / irate_right()Computes the instantaneous rate of change between the second and first points (left) or last and next-to-last points (right)
num_changes()Number of times the counter changed values.
num_elements()Number of items - any with the exact same time will have been counted only once.
num_changes()Number of times the counter reset.
slope()The slope of the least squares fit line of the adjusted counter value.
with_bounds(range)Applies bounds using the range (a TSTZRANGE) to the CounterSummary if they weren’t provided in the aggregation step

Percentile Approximation

These aggregate accessors deal with percentile approximation. For now we’ve only implemented them for percentile_agg and uddsketch based aggregates. We have not yet implemented them for tdigest.

ElementDescription
approx_percentile(p)The approximate value at percentile p
approx_percentile_rank(v)The approximate percentile a value v would fall in
error()The maximum relative error guaranteed by the approximation
mean()The exact average of the input values.
num_vals()The number of input values

Statistical aggregates

These aggregate accessors add support for common statistical aggregates (and were stabilized in our 1.3 release this week!). These allow you to compute and rollup() common statistical aggregates like average, stddev and more advanced ones like skewness as well as 2 dimensional aggregates like slope and covariance. Because there are both 1D and 2D versions of these, the accessors can have multiple forms, for instance, average() calculates the average on a 1D aggregate while average_y()& average_x() do so on each dimension of a 2D aggregate.

ElementDescription
average() / average_y() / average_x()The average of the values.
corr()The correlation coefficient of the least squares fit line.
covariance(method)The covariance of the values using either population or sample method.
determination_coeff()The determination coefficient (aka R squared) of the values.
kurtosis(method) / kurtosis_y(method) / kurtosis_x(method)The kurtosis (4th moment) of the values using either population or sample method.
intercept()The intercept of the least squares fit line.
num_vals()The number of (non-null) values seen.
sum() / sum_x() / sum_y()The sum of the values seen.
skewness(method) / skewness_y(method) / skewness_x(method)The skewness (3rd moment) of the values using either population or sample method.
slope()The slope of the least squares fit line.
stddev(method) / stddev_y(method) / stddev_x(method)The standard deviation of the values using either population or sample method.
variance(method) / variance_y(method) / variance_x(method)The variance of the values using either population or sample method.
x_intercept()The x intercept of the least squares fit line.

Time Weighted Averages

The average() accessor may be called on the output of a time_weight() like so:

SELECT time_weight('Linear', ts, val) -> average()  FROM measurements;

Approximate Count Distinct (Hyperloglog)

This is an approximation for distinct counts that was stabilized in our 1.3 release! The distinct_count() accessor may be called on the output of a hyperloglog() like so:

SELECT hyperloglog(device_id) -> distinct_count() FROM measurements;

Next steps

We hope this post helped you understand how function pipelines leverage PostgreSQL extensibility to offer functional programming concepts in a way that is fully PostgreSQL compliant. And how function pipelines can improve the ergonomics of your code making it easier to write, read, and maintain.

You can try function pipelines today with a fully-managed Timescale Cloud service (no credit card required, free for 30 days). Function pipelines are available now on every new database service on Timescale Cloud, so after you’ve created a new service, you’re all set to use them!

If you prefer to manage your own database instances, you can download and install the timescaledb_toolkit extension on GitHub for free, after which you’ll be able to use function pipelines.

We love building in public. You can view our upcoming roadmap on GitHub for a list of proposed features, as well as features we’re currently implementing and those that are available to use today. We also welcome feedback from the community (it helps us prioritize the features users really want). To contribute feedback, comment on an open issue or in a discussion thread in GitHub.

And if you want to hear more about function pipelines or meet some of the folks who helped build it, be sure to join us for our first Timescale Community Day on October 28, 2021!


Laurenz Albe: WITH HOLD cursors and transactions in PostgreSQL

$
0
0

Alice and the curser (fortunately not WITH HOLD)
© Laurenz Albe 2021

Both cursors and transactions are basic ingredients for developing a database application. This article describes how cursors and transactions interact and how WITH HOLD can work around their limitations. We will also see the dangers involved and how to properly use WITH HOLD cursors in a PL/pgSQL procedure.

Cursors in PostgreSQL

When a query is ready for execution, PostgreSQL creates a portal from which the result rows can be fetched. During normal query execution, you receive the whole result set in one step. In contrast, a cursor allows you to fetch the result rows one by one. A cursor marks a position within a result set. Cursors are particularly useful in procedural code on the client or in the database, because they allow you to loop through the query results. Another advantage is that a cursor allows you to have more than one SQL statement running at the same time, which is normally not possible in a single database session.

A simple example for PL/pgSQL code that uses a cursor would be:

DO LANGUAGE plpgsql
$$DECLARE
   /* declare and open a cursor */
   c CURSOR FOR SELECT table_schema,
                       table_name
                FROM information_schema.tables
                WHERE table_schema = 'mydata'
                  AND table_name LIKE 'old\_%';
   v_schema text;
   v_name text;
BEGIN
   LOOP
      /* get next result row */
      FETCH c INTO v_schema, v_name;

      /* system variable FOUND is set by FETCH */
      EXIT WHEN NOT FOUND;

      /* avoid SQL injection */
      EXECUTE format(
                 'DROP TABLE %I.%I',
                 v_schema,
                 v_name
              );
   END LOOP;

   /* not necessary */
   CLOSE c;
END;$$;

In this example, the SELECT is executed concurrently with the DROP TABLE statements.

The above is not the most readable way to write this in PL/pgSQL (you could have used “FOR v_schema, v_name IN SELECT ... LOOP ... END LOOP;”, which uses a cursor “under the hood”), but I wanted to make the cursor explicit.

Note that it is often possible to avoid a cursor loop by using a join in the database. Such a join is more efficient, because it does all the work in a single statement. However, we have to use a cursor in our case, since we need to execute a dynamic SQL statement inside the loop.

Cursors and transactions

One basic property of a PostgreSQL cursor is that it only exists for the duration of a database transaction. That is not surprising, since a cursor is a single SQL statement, and an SQL statement is always part of one transaction. In the above example we had no problem, because a DO statement is always executed in a single transaction anyway.

Cursors are automatically closed at the end of a transaction, so it is usually not necessary to explicitly close them, unless they are part of a long-running transaction and you want to free the resources allocated by the statement.

Cursors in SQL

A special feature of PostgreSQL is that you can use cursors in SQL. You create a cursor with the DECLARE statement:

DECLARE name [ BINARY ] [ ASENSITIVE | INSENSITIVE ] [ [ NO ] SCROLL ]
    CURSOR [ { WITH | WITHOUT } HOLD ] FOR query

Here is a short description of the different options:

  • BINARY will fetch the results in the internal binary format, which may be useful if you want to read bytea columns and avoid the overhead of escaping them as strings
  • SCROLL means that you can move the cursor position backwards to fetch the same rows several times
  • WITH HOLD creates a cursor that is not automatically closed at the end of a transaction
  • ASENSITIVE and INSENSITIVE are redundant in PostgreSQL and are there for SQL standard compatibility

There is also an SQL statement FETCH that is more powerful than its PL/pgSQL equivalent, in that it can fetch more than one row at a time. Like PL/pgSQL, SQL also has a MOVE statement that moves the cursor position without retrieving rows.

SQL cursors are closed with the CLOSE statement, or by the end of the transaction.

SCROLL cursors

Some execution plans, like a B-tree index scan or a sequential scan, can be executed in both directions. A cursor for a query with such an execution plan is implicitly scrollable, that is, you can move the cursor position backwards in the result set. PostgreSQL calculates query result rows “on demand” and streams them to the client, so scrollable cursors for such queries come with no extra cost.

Other, more complicated execution plans require the explicit keyword SCROLL for the cursor to become scrollable. Such cursors incur an overhead, because the server must cache the entire result set.

Here is a little example that showcases scrollable cursors:

BEGIN;

/* this cursor would be implicitly scrollable */
DECLARE c SCROLL CURSOR
   FOR SELECT * FROM generate_series(1, 10);

FETCH 5 FROM c;

 generate_series 
═════════════════
               1
               2
               3
               4
               5
(5 rows)

MOVE BACKWARD 2 FROM c;

FETCH BACKWARD 2 FROM c;

 generate_series 
═════════════════
               2
               1
(2 rows)

/* sixth result row */
FETCH ABSOLUTE 6 FROM c;

 generate_series 
═════════════════
               6
(1 row)

FETCH ALL FROM c;

 generate_series 
═════════════════
               7
               8
               9
              10
(4 rows)

COMMIT;

Cursor sensitivity

The SQL standard distinguishes SENSITIVE, INSENSITIVE and ASENSITIVE cursors. A sensitive cursor reflects modifications of the underlying data; one consequence of this is that scrolling back to a previous row might fetch a different result. PostgreSQL does not implement sensitive cursors: that would be difficult, because a statement always sees a stable snapshot of the data in PostgreSQL.

PostgreSQL cursors are always insensitive, which means that changes in the underlying data after the cursor has started processing are not visible in the data fetched from the cursor. “Asensitive”, which means that the sensitivity is implementation dependent, is the same as “insensitive” in PostgreSQL.

Note that this insensitivity also applies if you modify a table via the special statements “UPDATE/DELETE ... WHERE CURRENT OF <cursor>”.

WITH HOLD corsors

Since WITH HOLD cursors live longer than a transaction, but statements don’t, PostgreSQL must calculate the complete result set at COMMIT time and cache it on the server. This can result in COMMIT taking an unusually long time.

Moreover, WITH HOLD cursors are not automatically closed at the end of the transaction, so you must not forget to CLOSE them if you don’t want the result set to hog server resources until the end of the database session.

Here is an example of a WITH HOLD cursor in action:

BEGIN;

DECLARE c CURSOR WITH HOLD
   FOR SELECT i FROM generate_series(1, 10) AS i;

FETCH 3 FROM c;

 i 
═══
 1
 2
 3
(3 rows)

COMMIT;

FETCH 3 FROM c;

 i 
═══
 4
 5
 6
(3 rows)

/* important */
CLOSE c;

Cursors in PL/pgSQL

Cursors in PL/pgSQL are variables of the special data type refcursor. The value of such a variable is actually a string. That string is the name of the portal that is opened when a query is bound to the cursor variable and the cursor is opened.

Using refcursor variables, you can also pass cursors between PL/pgSQL functions or procedures:

CREATE FUNCTION c_open(n integer) RETURNS refcursor
   LANGUAGE plpgsql AS
$$DECLARE
   /* a query is bound to the cursor variable */
   c CURSOR (x integer) FOR SELECT * FROM generate_series(1, x);
BEGIN
   /* the cursor is opened */
   OPEN c(n);

   RETURN c;
END;$$;

CREATE FUNCTION c_fetch(cur refcursor) RETURNS TABLE (r integer)
   LANGUAGE plpgsql AS
$$BEGIN
   LOOP
      FETCH cur INTO r;

      EXIT WHEN NOT FOUND;

      RETURN NEXT;
   END LOOP;
END;$$;

SELECT c_fetch(c_open(5));

 c_fetch 
═════════
       1
       2
       3
       4
       5
(5 rows)

Cursor declarations in PL/pgSQL support SCROLL, but not WITH HOLD, for the historical reason that PostgreSQL functions always run inside a single transaction. Also, you can only FETCH a single row at a time from a PL/pgSQL cursor.

WITH HOLD cursors in PL/pgSQL procedures

Procedures, introduced in PostgreSQL v11, support transaction commands like COMMIT and ROLLBACKunder certain circumstances. Consequently, it would be useful to have WITH HOLD cursors in procedures. There are two ways to work around the lack of WITH HOLD cursors in PL/pgSQL:

  • create the cursor in SQL and pass it as a refcursor argument to the procedure
  • use dynamic SQL to declare an SQL cursor

Here is sample code that illustrates the second technique:

CREATE PROCEDURE del_old() LANGUAGE plpgsql AS
$$DECLARE
   /* assign the portal name */
   c refcursor := 'curs';
   v_schema text;
   v_name text;
BEGIN
   /* dynamic SQL to create the cursor */
   EXECUTE $_$DECLARE curs CURSOR WITH HOLD FOR
              SELECT table_schema,
                     table_name
              FROM information_schema.tables
              WHERE table_schema = 'mydata'
                AND table_name LIKE 'old\_%'$_$;
   LOOP
      FETCH c INTO v_schema, v_name;

      EXIT WHEN NOT FOUND;

      /*
       * We need to make sure that the cursor is closed
       * in the case of an error.  For that, we need an
       * extra block, because COMMIT cannot be used in
       * a block with an EXCEPTION clause.
       */
      BEGIN
         /* avoid SQL injection */
         EXECUTE format(
                    'DROP TABLE %I.%I',
                    v_schema,
                    v_name
                 );
      EXCEPTION
         WHEN OTHERS THEN
            CLOSE c;
            RAISE;
         WHEN query_canceled THEN
            CLOSE c;
            RAISE;
      END;

      /* reduce deadlock risk when dropping many tables */
      COMMIT;
   END LOOP;

   /* we need to close the cursor */
   CLOSE c;
END;$$;

Note how the code makes dead sure that the cursor cannot “leak” from the procedure!

Conclusion

Both cursors and transactions are well-known database features. Normally, cursors exist only within a single database transaction. But by using WITH HOLD, you can escape that limitation. Useful as this feature is, you have to be aware of the performance impact during COMMIT, and you have to make sure that you close the cursor to free the server’s resources.

The post WITH HOLD cursors and transactions in PostgreSQL appeared first on CYBERTEC.

Dave Page: 5 Cloud Architecture Considerations That Every Enterprise Should Know

$
0
0
Running databases in the cloud offers the promise of more efficient operations and greater agility. However, there are important architectural and management decisions that often people don’t consider. [Continue reading...]

Franck Pachot: Boosts Secondary Index Queries with Index Only Scan

$
0
0

Originally published at https://blog.yugabyte.com with YugabyteDB in mind, because the benefit of Index Only Scan is huge in a distributed database. It also applies to PostgreSQL, knowing that the hops to the table pages are saved only for the vacuumed part of the table.

A distributed SQL database reads from remote nodes, which increases the need for optimal data access. From my 20 years of experience as a database consultant, here is the most overlooked optimization technique for any database that supports it: performance-critical queries should find their columns from the index structure, without making extra hops to the table.

Many myths against SQL, such as “joins don’t scale” or “analytic queries need a separate column store”, have their source in bad index design. Contrary to a commonly used analogy, a database index is not like the index in a book. When you search for a keyword at the end of a book, an index entry will probably send you to one main page or, maybe two or three additional ones if you are not lucky. Only the words with small occurrences make their way to the book index, as it would be inefficient to list hundreds of pages for the most popular words. That’s different in a database, where we don’t skip any value.

With the exception of partial indexes, all table values from the indexed columns have their index entries. In SQL, you are allowed to query a range of values for which you will retrieve hundreds or thousands of rows. The index-to-table pages method (“Index Scan”), which is good enough for a few rows, is inefficient for a large range. But when the only alternative is a full scan (“Seq Scan”) on a big table, there’s no good access path offered to the query planner.

Fortunately, the query planner can come with a solution (“Index Only Scan”), but you need to provide the right index definition. In this blog post, we’ll explore how “Index Only Scan” can boost secondary index queries in a distributed SQL database.

Index Scan: a real-life example

Here is the classical CUSTOMER – ORDERS schema for a company selling coffee capsules, online and in local shops:

create table customers (
 customer_id     bigint constraint cust_pk primary key,
 customer_name   text   constraint cust_uk unique
);
create table orders (
 order_id bigint primary key,
 customer_id bigint references customers,
 order_date date,
 order_amount decimal
);

For welcoming loyal customers coming to the shop, the cashier — as soon as the customer identification is scanned (I’ll use the customer_name here for simplicity, even if we know it is not the best fit for a unique index) — gets an indication of the amount bought in the past 10 years. Here is the query:

select sum(ord.order_amount)
from customers cus join orders ord on cus.customer_id=ord.customer_id
where cus.customer_name='George Clooney'
and order_date > now() - interval '10 years';

The predicate on “customer_name” will use the index on it (implicitly created by the unique constraint here) and that’s efficient as it retrieves only one row. Obviously you don’t want a full table scan on ORDERS, as this is about one customer out of millions, and you have created an index on the foreign key:

create index ord_cust_id on orders(customer_id);

This is what most people do, and is often generated by a data modeling tool or JPA auto-ddl, from the foreign key definition. Developers think it is the right index because it is used by the query. Here is the explain(verbose) of it:

 Aggregate
   Output: sum(ord.order_amount)
   ->  Nested Loop       Output: ord.order_amount
       ->  Index Scan using cust_uk on public.customers cus
             Output: cus.customer_id, cus.customer_identification
             Index Cond: (cus.customer_identification = 'George Clooney'::text)
       ->  Index Scan using ord_cust_id on public.orders ord
             Output: ord.order_id, ord.customer_id, ord.order_date, ord.order_amount
             Index Cond: (ord.customer_id = cus.customer_id)
             Filter: (ord.order_date > (now() - '10 years'::interval))

When this is slow, developers may try to remove the join, thinking the slowness holds there. But the join is not a problem: a one-to-many join is a single iteration in the nested loop outer table. The problem is in the details of the inner table access. You see an index access, and that’s good. But don’t stop there. Is this index used efficiently?

Moving to Index Only Scan

Now, let’s think about how data is stored. This company has millions of customers. They order capsules regularly, but not frequently. Let’s say this loyal customer places an order every month. That means you’ll have to fetch 120 rows. That doesn’t seem like a lot, but think about where the rows are stored. They are scattered within the whole table because they arrived through those 10 years, interleaved with a load of millions of other orders.

If the rows were cached in memory, that would be fast. But storing 10 years of order items in RAM, just in case one of those customers comes up to the shop, would not be cost efficient. Those rows will be read from disks. And in a distributed SQL database, they may have to be fetched from multiple nodes.

This is a case — with shops in many countries — where geo-distribution makes sense. The latency adds-up, and even if in milliseconds, hundreds of them bring the response time over a second. From a business perspective, we would like this information to come fast for the most loyal customers, but the best customers are also those with a lot of orders.

You may think that you need to replicate this information to another database service dedicated to this analytic purpose. But this adds complexity, more code, and additional costs. In a relational database, data redundancy for performance purposes is a built-in automated feature provided by indexes. Luckily, all you need to do is define the right index for this. The solution is simple and has many names: “covering index”, “include index”, “projection index”, “fat index” and even “Tapio index” from the name of the author of “Interscience Relational Database Index Design and the Optimizers” (Tapio Lahdenmäki) who explained this in detail.

The SQL syntax is pretty simple:

create index ord_cust_new on orders(customer_id, order_date desc)
 include (order_amount);

That’s it. The columns have been added in two places in the CREATE INDEX statement. You add more columns to the index, and then you define which of them are used for access or for projection only. Here, “customer_id” and “order_date” are used to filter a range of rows based on the condition in the WHERE clause, so they are both required in the index key definition. As there’s a date range predicate on the date, it better fits in the index key than in the additional included columns. On the other hand, the “order_amount”, which is used only for the projection in the SELECT clause, doesn’t need to be part of the key, and including it out of the index key reduces the overhead on index maintenance.

Here is the subtle difference in the explain plan:

 Aggregate
   Output: sum(ord.order_amount)
   ->  Nested Loop
      Output: ord.order_amount
      ->  Index Scan using cust_uk on public.customers cus
            Output: cus.customer_id, cus.customer_identification
            Index Cond: (cus.customer_identification = 'George Clooney'::text)
      ->  Index Only Scan using ord_cust_new on public.orders ord
            Output: ord.customer_id, ord.order_date, ord.order_amount
            Index Cond: ((ord.customer_id = cus.customer_id) AND (ord.order_date > 
(now() - '10 years'::interval)))

As you can see, “Index Only Scan” has replaced “Index Scan”. This means that we have all information from the index, without having to fetch the many table rows that are scattered into the multiple nodes and disks. You see the whole WHERE clause predicate in “Index Condition”, and you know that all columns in “Output” were available from the index because of the “Index Only” access.

A larger index, but not a new one

It is important to note that we’ve created another index here, to show this tuning method, and that the query planner will be choosing it. But this new index should replace the other. As long as the previous columns stay first in the key column list, adding new columns adds more access paths, but still serves the previous ones. This means that in our example I can:

drop index ord_cust_id;

And a query on “customer_id” only will use the new index:

explain select count(*) from orders where customer_id=1;
                                      QUERY PLAN
---------------------------------------------------------------------
 Aggregate  (cost=15.50..15.51 rows=1 width=8)
   ->  Index Only Scan using ord_cust_new on orders  (cost=0.00..15.25 
rows=100 width=0)
      Index Cond: (customer_id = 1)

By keeping the same number of indexes, the insert and delete operations are not impacted, and the same access paths are allowed. This technique should not bring regression on other queries as long as the leading part of the indexed columns are the same as before, especially when the additional columns are not updated frequently. The order of columns in the “include” list does not matter, so one fat index can cover multiple queries.

When to use Index Only Scan

Now, you may have heard some myths about fat indexes. One is about the overhead to maintain them when rows are inserted, deleted or updated. But you need to think about it in context. For INSERT and DELETE, you still have the same index maintenance because you need the index anyway. This one just has more columns. For UPDATE, the index maintenance occurs when you change the indexed column value. In our example, the ORDERS amount will not change once inserted, so this concern is irrelevant. We’re using this optimization because the value is not changed frequently and is queried for many rows.

Another myth is about the size of the index. Yes, it is larger. But, again, think in context. The few additional bytes per row on disk is not a problem here. The size matters when it comes to memory because memory on a node is limited. And, yes, those index entries take more space in the cache. But think about what happens with the minimal index I had before: each query was bringing into memory hundreds of table rows, with all their columns, for non-frequent usage. There’s definitely a huge improvement on cache usage with Index Only Scan.

Of course, you will not inflate all indexes to cover all your queries. First, you need to focus on the performance-critical ones where you want to keep single-digit millisecond response time. Then, this is not required when reading one or two rows. Fat indexes are for range scans that have to read hundreds or thousands of rows. And then, you still have the agility to decide which columns are worth adding to the index key or as an included column. Columns where selective filters are applied are the most important ones because, even if you have to go to the table, you will have less rows to read from it. This is why I’ve added the “order_date” in this example: no need to read orders older than 10 years.

In addition, rather than just listing it in “include”, I’ve added it to the index key as a range (asc/desc) so that this filtering is done by the index access. Another advantage is that rows are returned already sorted, so it can help for queries that display all orders rather than the sum. A good index should help for many queries.

Beyond the filtering, you want to avoid table access when reading a lot of rows from a table. This is where you include in the index all the columns used by the query. The latter is an additional optimization for performance critical queries. With a complex data model serving multiple use-cases, it is important to understand these nuances between minimal index and indexes covering all queries, as a balance between accelerating reads without penalizing updates.

This also means that “SELECT *” is almost always a bad idea. You should only select the columns that you will need in order to benefit from Index Only Scan. And this recommendation is not only for SELECT. An UPDATE that lists all the table columns will be inefficient. This is why you need to set dynamic-update=true in Hibernate, which is unfortunately not the default.

Covering functions in the index

So far, the index examples we have used are based on columns, but YugabyteDB — an open source, distributed SQL database for transactional applications — also supports functional or expression based indexes. Let’s say we have a query like this:

select sum(ord.order_amount)
from customers cus join orders ord on cus.customer_id=ord.customer_id
where cus.customer_name='George Clooney'
and extract(dow from order_date)=0 /* Sunday */;

And define an index on (customer_id, (extract(dow from order_date))) to get fast access to those rows. To change it to a covering index, it is not sufficient to add include (order_amount) as I did above. We use the PostgreSQL query planner that doesn’t analyze — at optimization time — which columns are used in the expression. Then, if you want the expression to be covered, you need to add all columns that are used by the function. Here is the index that covers the query:

create index ord_cust_new
 on orders(customer_id, (extract(dow from order_date)))
 include (order_amount, order_date);

The execution plan shows an Index Only Scan with this index:

      ->  Index Only Scan using ord_cust_new on orders ord
                 Index Cond: ((customer_id = cus.customer_id) AND 
((date_part('dow'::text, (order_date)::timestamp without time zone)) = 
'0'::double precision))

Without the standalone order_date added, an Index Scan would have been executed even if not required to get the result.

But what about primary indexes?

I explicitly mentioned secondary indexes in the title, but what about the primary key? The good thing is that, in YugabyteDB, you don’t care. An “Index Scan” on the primary key finds all columns without any additional work because the whole table is stored in the LSM-tree structure. Do not expect an “Index Only Scan” from the primary index, because it is already optimized by the DocDB storage. This is different for PostgreSQL, or Oracle heap tables, where all indexes are secondary indexes and always need an extra hop to the table if they are not covering all query columns. YugabyteDB is primarily optimized for OLTP with primary key access. The secondary index optimization discussed here allows the additional analytic queries — which are part of many OLTP applications — for running efficiently within the same OLTP database.

Further reading

Haiying Tang: What is the new LZ4 TOAST compression in PostgreSQL 14, and how fast is it?

$
0
0

The new column compression option LZ4 introduced in PostgreSQL 14 provides faster compression. How can I use it, and how much faster is it? 

Álvaro Hernández: StackGres 1.0.0: Open Source Postgres-aaS with 120+ Extensions

$
0
0

StackGres 1.0.0: Open Source Postgres-aaS with 120+ Extensions

TL;DR

We’re announcing StackGres 1.0.0 GA. StackGres is an Open Source Postgres-as-a-Service that runs on any Kubernetes environment. StackGres, today, is the Postgres platform with the most Postgres extensions available: 120 as of today. Many more to come in the future.

Why Kubernetes?

At OnGres we have been providing Postgres Professional Services for years. We love IaC (Infrastructure as Code). One of our realizations was that it was next to impossible to create a “single IaC package” that would contain Postgres and its Stack of components, that would work on any environment. There are just so many different compute, storage and networking components, with different APIs.

So we asked ourselves: is there any “abstraction layer” for distributed infrastructure? The answer was Kubernetes. Kubernetes APIs allowed us to create a single deployable Postgres package, that contains Postgres itself and all its necessary Stack, and run –almost– anywhere.

StackGres, a feature-rich Postgres Platform

StackGres contains the full Stack of Postgres components needed for production:

  • High availability and automated failover, using Patroni. Managed DNS endpoints for the primary (rw) and replicas (ro, load balanced) connections.

  • Built-in, enabled by default connection-pooling.

  • Configurable automated backups with retention policies.

  • Prometheus metrics exporter, with auto-binding. Customized Postgres metrics, Grafana dashboards and Postgres alerts.

  • Expertly tuned default configurations –which you can optionally override.

StackGres: innovating the Postgres Platform

So far, the features mentioned may be “table stakes” as of today. We wanted to provide more. StackGres has introduced significant innovations–while keeping the same Postgres core boring– such as:

  • Using Envoy to proxy all Postgres traffic. In collaboration with the Envoy Community, we developed the Postgres filter for Envoy. StackGres uses it to add additional monitoring (by inspecting the wire protocol) and to terminate SSL!

  • A fully-featured Web Console. It supports everything StackGres can do. And yes, it has a dark mode ;)

Insert benchmark comparison

  • A system to push all Postgres and Patroni logs from all pods to a central server. This central server is a separate, StackGres-managed Postgres database with Timescale to support large volume and high velocity logging. Query your logs with SQL or from the Web Console!

  • Fully automated “Day 2” DBA operations, including: minor and major version upgrades, controlled cluster restart, container version upgrades, vacuum, repack and even benchmarks!

  • A system to dynamically load Postgres extensions into the containers. This allows us to ship lighter containers, while supporting potentially hundreds of extensions –120 as of now, with many more to come in the future.

Easy of installation and use

Production Postgres requires a fair amount of Postgres expertise. One of our goals is to “democratize production Postgres”, and make it accessible to almost anyone. StackGres is designed to be easy to operate.

Visit StackGres Installation page for a quick primer on how to install StackGres either via kubectl or Helm (Helm Chart also published to ArtifactHub, please star it!). There is more detailed information in the Production Installation section of the documentation.

For this post we will use Helm installation. Add the StackGres Helm repository:

helm repo add stackgres-charts https://stackgres.io/downloads/stackgres-k8s/stackgres/helm/

And install with Helm into a dedicated namespace:

helm install --namespace stackgres --create-namespace \
    stackgres stackgres-charts/stackgres-operator

The above command will normally take 0-2 minutes. Now let’s create our first cluster. If you prefer to do it all from the Web Console, skip to the next section. Create a simple YAML file named demo.yamlwith the following content:

apiVersion: stackgres.io/v1kind: SGClustermetadata:
  name: sgdemospec:
  postgres:
    version: 'latest'instances: 1pods:
    persistentVolume:
      size: '5Gi'

Then kubectl apply the file:

kubectl apply -f demo.yaml

The easiest way to connect is to run psql within the postgres-util container, an administration container that is always started with your StackGres pods:

kubectl exec -it sgdemo-0 -c postgres-util -- psql

You may also connect from any other pod within the cluster. Note that the default username is postgres and the password is randomly generated and written in a secret named as the cluster, with the key superuser-password.

Adding an extension!

StackGres supports as of today more than 120 extensions, with many more to come in the future.

As mentioned, StackGres loads extensions dynamiycally. This means that save a few that are required for internal operation, your cluster will contain by default no other extension. You can check by running the select * from pg_available_extensions() query from within psql. This is good! Containers are lighter and are exposed to fewer potential security problems.

But now let’s say you want to use your favorite extension, for example citext. This is a very useful extension to deal with text in a case insensitive manner. To add this extension to the live container, you just need to edit the .spec.postgres section of the SGCluster as in:

...
spec:
  postgres:
    version: 'latest'extensions:
      - name: 'citext'
...

(you can either edit the original YAML file and then do kubectl apply -f demo.yaml or directly via kubectl edit sgcluster demo)

In a few seconds, the extension will be available!. You may also do it from the Web Console by editing the cluster and selecting the extension(s) that you want to lead from the Web Console. Now, to use the extension, you just need to run in psql, as with any Postgres extension:

create extension citext;

To release the resources, simply run kubectl delete -f demo.yaml.

Using the Web Console

As mentioned, StackGres comes with a fully-featured Web Console. Let’s check it out. Let’s identify the pod name where the Web Console is running and do a port-forward to expose it in our laptop:

WEBC=`kubectl --namespace stackgres get pods -l "app=stackgres-restapi" -o jsonpath="{.items[0].metadata.name}"`
kubectl --namespace stackgres port-forward "$WEBC" 8443:9443

For production usage, it is recommended to use a LoadBalancer as part of the installation Helm parameters or setup an Ingress controller. Check the Installation via Helm section of the documentation for more information.

Open in your web browser the address https://localhost:8443/. You will see a warning (expected, the certificate is self-signed, you may also bring your own with custom Helm parameters). The default username is admin. The password is generated randomly and can be queried with the following command:

kubectl -n stackgres get secret stackgres-restapi --template '{{ printf "%s\n" (.data.clearPassword | base64decode) }}'

Select a namespace to work on (you may use default, or create one with kubectl create namespace demo). On the left pane hover over StackGres Clusters and click on the (+) icon to create a new cluster. You may create a simple cluster as in the following example:

Web Console: create simple StackGres cluster

Once it is creating, you may click on the name to see the details about the cluster. Check the View Connection Info link to the top right of the cluster information to see quick commands on how to connect to your new cluster.

Next steps: tutorial

Want to go deeper? Check our tutorial, which explores the main StackGres features in a little bit more detail.

Please drop us a note and let us know what you think in our Slack and/or Discord channels. Also write us there if you find any bug or want to become a collaborator.

Viewing all 9885 articles
Browse latest View live
<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>