Quantcast
Channel: Planet PostgreSQL
Viewing all 9634 articles
Browse latest View live

Ryan Booz: What is ClickHouse, how does it compare to PostgreSQL and TimescaleDB, and how does it perform for time-series data?

$
0
0

Over the past year, one database we keep hearing about is ClickHouse, a column-oriented OLAP database initially built and open-sourced by Yandex.

In this detailed post, which is the culmination of 3 months of research and analysis, we answer the most common questions we hear, including:

  • What is ClickHouse (including a deep dive of its architecture)
  • How does ClickHouse compare to PostgreSQL
  • How does ClickHouse compare to TimescaleDB
  • How does ClickHouse perform for time-series data vs. TimescaleDB

Want to learn more? Join @ryanbooz and @jonatasdp on Wednesday, October 27, 2021 @ 10 AM ET (4 PM CET) to see how we setup our testing environment, configured the servers, and watch us run a test cycle on both ClickHouse and TimescaleDB!

📺 twitch.tv/timescaledb

👉 youtube.com/timescaledb

Shout out to Timescale engineers Alexander Kuzmenkov, who was most recently a core developer on ClickHouse, and Aleksander Alekseev, who is also a PostgreSQL contributor, who helped check our work and keep us honest with this post.

Benchmarking, not “Benchmarketing”

At Timescale, we take our benchmarks very seriously. We find that in our industry there is far too much vendor-biased “benchmarketing” and not enough honest “benchmarking.” We believe developers deserve better. So we take great pains to really understand the technologies we are comparing against - and also to point out places where the other technology shines (and where TimescaleDB may fall short).

You can see this in our other detailed benchmarks vs. AWS Timestream (29 minute read), MongoDB (19 minute read), and InfluxDB (26 minute read).

We’re also database nerds at heart who really enjoy learning about and digging into other systems. (Which are a few reasons why these posts - including this one - are so long!)

So to better understand the strengths and weaknesses of ClickHouse, we spent the last three months and hundreds of hours benchmarking, testing, reading documentation, and working with contributors.

How ClickHouse fared in our tests

ClickHouse is a very impressive piece of technology. In some tests, ClickHouse proved to be a blazing fast database, able to ingest data faster than anything else we’ve tested so far (including TimescaleDB). In some complex queries, particularly those that do complex grouping aggregations, ClickHouse is hard to beat.

But nothing in databases comes for free. ClickHouse achieves these results because its developers have made specific architectural decisions. These architectural decisions also introduce limitations, especially when compared to PostgreSQL and TimescaleDB.

ClickHouse’s limitations / weaknesses include:

  • Worse query performance than TimescaleDB at nearly all queries in the Time-Series Benchmark Suite, except for complex aggregations.
  • Poor inserts and much higher disk usage (e.g., 2.7x higher disk usage than TimescaleDB) at small batch sizes (e.g., 100-300 rows/batch).
  • Non-standard SQL-like query language with several limitations (e.g., joins are discouraged, syntax is at times non-standard)
  • Lack of other features one would expect in a robust SQL database (e.g., PostgreSQL or TimescaleDB): no transactions, no correlated sub-queries, no stored procedures, no user defined functions, no index management beyond primary and secondary indexes, no triggers.
  • Inability to modify or delete data at a high rate and low latency - instead have to batch deletes and updates
  • Batch deletes and updates happen asynchronously
  • Because data modification is asynchronous, ensuring consistent backups is difficult: the only way to ensure a consistent backup is to stop all writes to the database
  • Lack of transactions and lack of data consistency also affects other features like materialized views, because the server can't atomically update multiple tables at once. If something breaks during a multi-part insert to a table with materialized views, the end result is an inconsistent state of your data.

We list these shortcomings not because we think ClickHouse is a bad database. We actually think it’s a great database - well, to be more precise, a great database for certain workloads. And as a developer, you need to choose the right tool for your workload.

Why does ClickHouse fare well in certain cases, but worse in others?

The answer is the underlying architecture.

Generally in databases there are two types of fundamental architectures, each with strengths and weaknesses: OnLine Transactional Processing (OLTP) and OnLine Analytical Processing (OLAP).

OLTPOLAP
Large and small datasetsLarge datasets focused on reporting/analysis
Transactional data (the raw, individual records matter)Pre-aggregated or transformed data to foster better reporting
Many users performing varied queries and updates on data across the systemFewer users performing deep data analysis with few updates
SQL is the primary language for interactionOften, but not always, utilizes a particular query language other than SQL

ClickHouse, PostgreSQL, and TimescaleDB architectures

At a high level, ClickHouse is an excellent OLAP database designed for systems of analysis.

PostgreSQL, by comparison, is a general-purpose database designed to be a versatile and reliable OLTP database for systems of record with high user engagement.

TimescaleDB is a relational database for time-series: purpose-built on PostgreSQL for time-series workloads. It combines the best of PostgreSQL plus new capabilities that increase performance, reduce cost, and provide an overall better developer experience for time-series.

So, if you find yourself needing to perform fast analytical queries on mostly immutable large datasets with few users, i.e., OLAP, ClickHouse may be the better choice.

Instead, if you find yourself needing something more versatile, that works well for powering applications with many users and likely frequent updates/deletes, i.e., OLTP, PostgreSQL may be the better choice.

And if your applications have time-series data - and especially if you also want the versatility of PostgreSQL - TimescaleDB is likely the best choice.

Time-series Benchmark Suite results summary (TimescaleDB vs. ClickHouse)

We can see the impact of these architectural decisions in how TimescaleDB and ClickHouse fare with time-series workloads.

We spent hundreds of hours working with ClickHouse and TimescaleDB during this benchmark research. We tested insert loads from 100 million rows (1 billion metrics) to 1 billion rows (10 billion metrics), cardinalities from 100 to 10 million, and numerous combinations in between. We really wanted to understand how each database works across various datasets.

Overall, for inserts we find that ClickHouse outperforms on inserts with large batch sizes - but underperforms with smaller batch sizes. For queries, we find that ClickHouse underperforms on most queries in the benchmark suite, except for complex aggregates.

Insert Performance

When rows are batched between 5,000 and 15,000 rows per insert, speeds are fast for both databases, with ClickHouse performing noticeably better:

Insert comparison between ClickHouse and TimescaleDB at cardinalities between 100 and 1 million hosts
Performance comparison: ClickHouse outperforms TimescaleDB at all cardinalities when batch sizes are 5,000 rows or greater

However, when the batch size is smaller, the results are reversed in two ways: insert speed and disk consumption. With larger batches of 5,000 rows/batch, ClickHouse consumed ~16GB of disk during the test, while TimescaleDB consumed ~19GB (both before compression).

With smaller batch sizes, not only does TimescaleDB maintain steady insert speedsthat are faster than ClickHouse between 100-300 rows/batch, but disk usage is 2.7x higher with ClickHouse. This difference should be expected because of the architectural design choices of each database, but it's still interesting to see.

Insert comparison of TimescaleDB and ClickHouse with small batch sizes. TimescaleDB outperforms and uses 2.7x less disk space.
Performance comparison: Timescale outperforms ClickHouse with smaller batch sizes and uses 2.7x less disk space

Query performance

For testing query performance, we used a "standard" dataset that queries data for 4,000 hosts over a three-day period, with a total of 100 million rows. In our experience running benchmarks in the past, we found that this cardinality and row count works well as a representative dataset for benchmarking because it allows us to run many ingest and query cycles across each database in a few hours.

Based on ClickHouse’s reputation as a fast OLAP database, we expected ClickHouse to outperform TimescaleDB for nearly all queries in the benchmark.

When we ran TimescaleDB without compression, ClickHouse did outperform.

However, when we enabled TimescaleDB compression - which is the recommended approach - we found the opposite, with TimescaleDB outperforming nearly across the board:

Bar chart displaying results of query response between TimescaleDB and ClickHouse. TimescaleDB outperforms in almost every query category.
Results of query benchmarking between TimescaleDB and ClickHouse. TimescaleDB outperforms in almost every query category

(For those that want to replicate our findings or better understand why ClickHouse and TimescaleDB perform the way they do under different circumstances, please read the entire article for the full details.)

Cars vs. Bulldozers

Today we live in the golden age of databases: there are so many databases that all these lines (OLTP/OLAP/time-series/etc.) are blurring. Yet every database is architected differently, and as a result, has different advantages and disadvantages. As a developer, you should choose the right tool for the job.

After spending lots of time with ClickHouse, reading their docs, and working through weeks of benchmarks, we found ourselves repeating this simple analogy:

ClickHouse is like a bulldozer - very efficient and performant for a specific use-case. PostgreSQL (and TimescaleDB) is like a car: versatile, reliable, and useful in most situations you will face in your life.

Most of the time, a “car” will satisfy your needs. But if you find yourself doing a lot of “construction”, by all means, get a “bulldozer.”

We aren’t the only ones who feel this way. Here is a similar opinion shared on HackerNews by stingraycharles (whom we don’t know, but stingraycharles if you are reading this - we love your username):

"TimescaleDB has a great timeseries story, and an average data warehousing story; Clickhouse has a great data warehousing story, an average timeseries story, and a bit meh clustering story (YMMV)."

In the rest of this article, we do a deep dive into the ClickHouse architecture, and then highlight some of the advantages and disadvantages of ClickHouse, PostgreSQL, and TimescaleDB, that result from the architectural decisions that each of its developers (including us) have made. We conclude with a more detailed time-series benchmark analysis. We also have a detailed description of our testing environment to replicate these tests yourself and verify our results.

Yes, we’re the makers of TimescaleDB, so you may not trust our analysis. If so, we ask you to hold your skepticism for the next few minutes, and give the rest of this article a read. As you (hopefully) will see, we spent a lot of time in understanding ClickHouse for this comparison: first, to make sure we were conducting the benchmark the right way so that we were fair to Clickhouse; but also, because we are database nerds at heart and were genuinely curious to learn how ClickHouse was built.

Next steps

Are you curious about TimescaleDB? The easiest way to get started is by creating a free Timescale Cloud account, which will give you access to a fully-managed TimescaleDB instance (100% free for 30 days).

If you want to host TimescaleDB yourself, you can do it completely for free - visit our GitHub to learn more about options, get installation instructions, and more (⭐️  are very much appreciated! 🙏)

One last thing: you can join our Community Slack to ask questions, get advice, and connect with other developers (we are +7,000 and counting!). We, the authors of this post, are very active on all channels - as well as all our engineers, members of Team Timescale, and many passionate users.

What is ClickHouse?

ClickHouse, short for “Clickstream Data Warehouse”, is a columnar OLAP database that was initially built for web analytics in Yandex Metrica. Generally, ClickHouse is known for its high insert rates, fast analytical queries, and SQL-like dialect.

Timeline of ClickHouse development from 2008 to 2020

Timeline of ClickHouse development (Full history here.)

We are fans of ClickHouse. It is a very good database built around certain architectural decisions that make it a good option for OLAP-style analytical queries. In particular, in our benchmarking with the Time Series Benchmark Suite (TSBS), ClickHouse performed better for data ingestion than any time-series database we've tested so far (TimescaleDB included) at an average of more than 600k rows/second on a single instance, when rows are batched appropriately.

But nothing in databases comes for free - and as we’ll show below, this architecture also creates significant limitations for ClickHouse, making it slower for many types of time-series queries and some insert workloads. If your application doesn't fit within the architectural boundaries of ClickHouse (or TimescaleDB for that matter), you'll probably end up with a frustrating development experience, redoing a lot of work down the road.

The ClickHouse Architecture

ClickHouse was designed for OLAP workloads, which have specific characteristics. From the ClickHouse documentation, here are some of the requirements for this type of workload:

  • The vast majority of requests are for read access.
  • Data is inserted in fairly large batches (> 1000 rows), not by single rows; or it is not updated at all.
  • Data is added to the DB but is not modified.
  • For reads, quite a large number of rows are processed from the DB, but only a small subset of columns.
  • Tables are “wide,” meaning they contain a large number of columns.
  • Queries are relatively rare (usually hundreds of queries per server or less per second).
  • For simple queries, latencies around 50 ms are allowed.
  • Column values are fairly small: numbers and short strings (for example, 60 bytes per URL).
  • Requires high throughput when processing a single query (up to billions of rows per second per server).
  • Transactions are not necessary.
  • Low requirements for data consistency.
  • There is one large table per query. All tables are small, except for one.
  • A query result is significantly smaller than the source data. In other words, data is filtered or aggregated, so the result fits in a single server’s RAM.

How is ClickHouse designed for these workloads? Here are some of the key aspects of their architecture:

  • Compressed, column-oriented storage
  • Table Engines
  • Indexes
  • Vector Computation Engine

Compressed, column-oriented storage

First, ClickHouse (like nearly all OLAP databases) is column-oriented (or columnar), meaning that data for the same table column is stored together. (In contrast, in row-oriented storage, used by nearly all OLTP databases, data for the same table row is stored together.)

Column-oriented storage has a few advantages:

  • If your query only needs to read a few columns, then reading that data is much faster (you don’t need to read entire rows, just the columns)
  • Storing columns of the same data type together leads to greater compressibility (although, as we have shown, it is possible to build columnar compression into row-oriented storage).

Table engines

To improve the storage and processing of data in ClickHouse, columnar data storage is implemented using a collection of table "engines". The table engine determines the type of table and the features that will be available for processing the data stored inside.

ClickHouse primarily uses the MergeTree table engine as the basis for how data is written and combined. Nearly all other table engines derive from MergeTree, and allow additional functionality to be performed automatically as the data is (later) processed for long-term storage.

(Quick clarification: from this point forward whenever we mention MergeTree, we're referring to the overall MergeTree architecture design and all table types that derive from it unless we specify a specific MergeTree type)

At a high level, MergeTree allows data to be written and stored very quickly to multiple immutable files (called "parts" by ClickHouse). These files are later processed in the background at some point in the future and merged into a larger part with the goal of reducing the total number of parts on disk (fewer files = more efficient data reads later). This is one of the key reasons behind ClickHouse’s astonishingly high insert performance on large batches.

All columns in a table are stored in separate parts (files), and all values in each column are stored in the order of the primary key. This column separation and sorting implementation make future data retrieval more efficient, particularly when computing aggregates on large ranges of contiguous data.

Indexes

Once the data is stored and merged into the most efficient set of parts for each column, queries need to know how to efficiently find the data. For this, Clickhouse relies on two types of indexes: the primary index, and additionally, a secondary (data skipping) index.

Unlike a traditional OLTP, BTree index which knows how to locate any row in a table, the ClickHouse primary index is sparse in nature, meaning that it does not have a pointer to the location of every value for the primary index. Instead, because all data is stored in primary key order, the primary index stores the value of the primary key every N-th row (called index_granularity, 8192 by default). This is done with the specific design goal of fitting the primary index into memory for extremely fast processing.

When your query patterns fit with this style of index, the sparse nature can help improve query speed significantly. The one limitation is that you cannot create other indexes on specific columns to help improve a different query pattern. We'll discuss this more later.

Vector Computation Engine

ClickHouse was designed with the desire to have "online" query processing in a way that other OLAP databases hadn't been able to achieve. Even with compression and columnar data storage, most other OLAP databases still rely on incremental processing to pre-compute aggregated data. It has generally been the pre-aggregated data that's provided the speed and reporting capabilities.

To overcome these limitations, ClickHouse implemented a series of vector algorithms for working with large arrays of data on a column-by-column basis. With vectorized computation, ClickHouse can specifically work with data in blocks of tens of thousands or rows (per column) for many computations. Vectorized computing also provides an opportunity to write more efficient code that utilizes modern SIMD processors, and keeps code and data closer together for better memory access patterns, too.

In total, this is a great feature for working with large data sets and writing complex queries on a limited set of columns, and something TimescaleDB could benefit from as we explore more opportunities to utilize columnar data.

That said, as you'll see from the benchmark results, enabling compression in TimescaleDB (which converts data into compressed columnar storage), improves the query performance of many aggregate queries in ways that are even better than ClickHouse.

ClickHouse disadvantages because of its architecture (aka: nothing comes for free)

Nothing comes for free in database architectures. Clearly ClickHouse is designed with a very specific workload in mind. Similarly, it is not designed for other types of workloads.

We can see an initial set of disadvantages from the ClickHouse docs:

  • No full-fledged transactions.
  • Lack of ability to modify or delete already inserted data with a high rate and low latency. There are batch deletes and updates available to clean up or modify data, for example, to comply with GDPR, but not for regular workloads.
  • The sparse index makes ClickHouse not so efficient for point queries retrieving single rows by their keys.

There are a few disadvantages that are worth going into detail:

  • Data can’t be directly modified in a table
  • Some “synchronous” actions aren’t really synchronous
  • SQL-like, but not quite SQL
  • No data consistency in backups

MergeTree Limitation: Data can’t be directly modified in a table

All tables in ClickHouse are immutable. There is no way to directly update or delete a value that's already been stored. Instead, any operations that UPDATE or DELETE data can only be accomplished through an `ALTER TABLE` statement that applies a filter and actually re-writes the entire table (part by part) in the background to update or delete the data in question. Essentially it's just another merge operation with some filters applied.

As a result, several MergeTree table engines exist to solve this deficiency - to solve for common scenarios where frequent data modifications would otherwise be necessary. Yet this can lead to unexpected behavior and non-standard queries.

As an example, if you need to store only the most recent reading of a value, creating a CollapsingMergeTree table type is your best option. With this table type, an additional column (called `Sign`) is added to the table which indicates which row is the current state of an item when all other field values match. ClickHouse will then asynchronously delete rows with a `Sign` that cancel each other out (a value of 1 vs -1), leaving the most recent state in the database.

As an example, consider a common database design pattern where the most recent values of a sensor are stored alongside the long-term time-series table for fast lookup. We'll call this table SensorLastReading. In ClickHouse, this table would require the following pattern to store the most recent value every time new information is stored in the database.

SensorLastReading

SensorIDTempCpuSign
155781

When new data is received, you need to add 2 more rows to the table, one to negate the old value, and one to replace it.

SensorIDTempCpuSign
155781
15578-1
140351

At some point after this insert, ClickHouse will merge the changes, removing the two rows that cancel each other out on Sign, leaving the table with just this row:

SensorIDTempCpuSign
140351

But remember, MergeTree operations are asynchronous and so queries can occur on data before something like the collapse operation has been performed. Therefore, the queries to get data out of a CollapsingMergeTree table require additional work, like multiplying rows by their `Sign`, to make sure you get the correct value any time the table is in a state that still contains duplicate data.

Here is one solution that the ClickHouse documentation provides, modified for our sample data. Notice that with numerical numbers, you can get the "correct" answer by multiplying all values by the Sign column and adding a HAVING clause.

SELECT
    SensorID,
    sum(Temp * Sign) AS Temp,
    sum(Cpu * Sign) AS Cpu
FROM SensorLastReading
GROUP BY SensorId
HAVING sum(Sign) > 0

Again, the value here is that MergeTree tables provide really fast ingestion of data at the expense of transactions and simple concepts like UPDATE and DELETE in the way traditional applications would try to use a table like this. With ClickHouse, it's just more work to manage this kind of data workflow.

Because ClickHouse isn't an ACID database, these background modifications (or really any data manipulations) have no guarantees of ever being completed. Because there is no such thing as transaction isolation, any SELECT query that touches data in the middle of an UPDATE or DELETE modification (or a Collapse modification as we noted above) will get whatever data is currently in each part. If the delete process, for instance, has only modified 50% of the parts for a column, queries would return outdated data from the remaining parts that have not yet been processed.

More importantly, this holds true for all data that is stored in ClickHouse, not just the large, analytical focused tables that store something like time-series data, but also the related metadata. While it's understandable that time-series data, for example, is often insert-only (and rarely updated), business-centric metadata tables almost always have modifications and updates as time passes. Regardless, the related business data that you may store in ClickHouse to do complex joins and deeper analysis is still in a MergeTree table (or variation of a MergeTree), and therefore, updates or deletes would still require an entire rewrite (through the use of `ALTER TABLE`) any time there are modifications.

Distributed MergeTree tables

Distributed tables are another example of where asynchronous modifications might cause you to change how you query data. If your application writes data directly to the distributed table (rather than to different cluster nodes which is possible for advanced users), the data is first written to the "initiator" node, which in turn copies the data to the shards in the background as quickly as possible. Because there are no transactions to verify that the data was moved as part of something like two-phase commits (available in PostgreSQL), your data might not actually be where you think it is.

There is at least one other problem with how distributed data is handled. Because ClickHouse does not support transactions and data is in a constant state of being moved, there is no guarantee of consistency in the state of the cluster nodes. Saving 100,000 rows of data to a distributed table doesn't guarantee that backups of all nodes will be consistent with one another (we'll discuss reliability in a bit). Some of that data might have been moved, and some of it might still be in transit.

Again, this is by design, so there's nothing specifically wrong with what's happening in ClickHouse! It's just something to be aware of when comparing ClickHouse to something like PostgreSQL and TimescaleDB.

Some “synchronous” actions aren’t really synchronous

Most actions in ClickHouse are not synchronous. But we found that even some of the ones labeled “synchronous” weren’t really synchronous either.

One particular example that caught us by surprise during our benchmarking was how `TRUNCATE` worked. We ran many test cycles against ClickHouse and TimescaleDB to identify how changes in row batch size, workers, and even cardinality impacted the performance of each database. At the end of each cycle, we would `TRUNCATE` the database in each server, expecting the disk space to be released quickly so that we could start the next test. In PostgreSQL (and other OLTP databases), this is an atomic action. As soon as the truncate is complete, the space is freed up on disk.

Dashboard graph showing disk usage and immedate release of space after using TRUNCATE
TRUNCATE is an atomic action in TimescaleDB/PostgreSQL and frees disk almost immediately

We expected the same thing with ClickHouse because the documentation mentions that this is a synchronous action (and most things are not synchronous in ClickHouse). It turns out, however, that the files only get marked for deletion and the disk space is freed up at a later, unspecified time in the background. There's no specific guarantee for when that might happen.

Dashboard graph showing disk usage of ClickHouse and the time needed to free disk space after TRUNCATE
TRUNCATE is an asynchronous action in ClickHouse, freeing disk at some future time

For our tests it was a minor inconvenience. We had to add a 10-minute sleep into the testing cycle to ensure that ClickHouse had released the disk space fully. In real-world situations, like ETL processing that utilizes staging tables, a `TRUNCATE` wouldn't actually free the staging table data immediately - which could cause you to modify your current processes.

We point a few of these scenarios out to simply highlight the point that ClickHouse isn't a drop-in replacement for many things that a system of record (OLTP database) is generally used for in modern applications. Asynchronous data modification can take a lot more effort to effectively work with data.

SQL-like, but not quite SQL

In many ways, ClickHouse was ahead of its time by choosing SQL as the language of choice.

ClickHouse chose early in its development to utilize SQL as the primary language for managing and querying data. Given the focus on data analytics, this was a smart and obvious choice given that SQL was already widely adopted and understood for querying data.

In ClickHouse, the SQL isn't something that was added after the fact to satisfy a portion of the user community. That said, what ClickHouse provides is a SQL-like language that doesn't comply with any actual standard.

The challenges of a SQL-like query language are many. For example, retraining users who will be accessing the database (or writing applications that access the database). Another challenge is a lack of ecosystem: connectors and tools that speak SQL won’t just work out of the box - i.e., they will require some modification (and again knowledge by the user) to work.

Overall, ClickHouse handles basic SQL queries well.

However, because the data is stored and processed in a different way from most SQL databases, there are a number of commands and functions you may expect to use from a SQL database (e.g., PostgreSQL, TimescaleDB), but which ClickHouse doesn't support or has limited support for:

  • Not optimized for JOINs
  • No index management beyond the primary and secondary indexes
  • No recursive CTEs
  • No correlated subqueries or LATERAL joins
  • No stored procedures
  • No user defined functions
  • No triggers

One example that stands out about ClickHouse is that JOINs, by nature, are generally discouraged because the query engine lacks any ability to optimize the join of two or more tables. Instead, users are encouraged to either query table data with separate sub-select statements and then and then use something like a `ANY INNER JOIN` which strictly looks for unique pairs on both sides of the join (avoiding a cartesian product that can occur with standard JOIN types). There's also no caching support for the product of a JOIN, so if a table is joined multiple times, the query on that table is executed multiple times, further slowing down the query.

For example, all of the "double-groupby" queries in TSBS group by multiple columns and then join to the tag table to get the `hostname` for the final output. Here is how that query is written for each database.

TimescaleDB:

WITH cpu_avg AS (
     SELECT time_bucket('1 hour', time) as hour,
       hostname, 
	   AVG(cpu_user) AS mean_cpu_user
     FROM cpu
     WHERE time >= '2021-01-01T12:30:00Z' 
       AND time < '2021-01-02T12:30:00Z'
     GROUP BY 1, 2
)
SELECT hour, hostname, mean_cpu_user
FROM cpu_avg
JOIN tags ON cpu_avg.tags_id = tags.id
ORDER BY hour, hostname;

ClickHouse:

SELECT
    hour,
    id,
    mean_cpu_user
FROM
(
    SELECT
        toStartOfHour(created_at) AS hour,
        tags_id AS id,
        AVG(cpu_user) as mean_cpu_user
    FROM cpu
    WHERE (created_at >= '2021-01-01T12:30:00Z') 
        AND (created_at < '2021-01-02T12:30:00Z')
    GROUP BY
        hour,
        id
) AS cpu_avg
ANY INNER JOIN tags USING (id)
ORDER BY
    hour ASC,
    id;

Reliability: no data consistency in backups

One last aspect to consider as part of the ClickHouse architecture and its lack of support for transactions is that there is no data consistency in backups. As we've already shown, all data modification (even sharding across a cluster) is asynchronous, therefore the only way to ensure a consistent backup would be to stop all writes to the database and then make a backup. Data recovery struggles with the same limitation.

The lack of transactions and data consistency also affects other features like materialized views because the server can't atomically update multiple tables at once. If something breaks during a multi-part insert to a table with materialized views, the end result is an inconsistent state of your data.

ClickHouse is aware of these shortcomings and is certainly working on or planning updates for future releases. Some form of transaction support has been in discussion for some time and backups are in process and merged into the main branch of code, although it's not yet recommended for production use. But even then, it only provides limited support for transactions.

ClickHouse vs. PostgreSQL

(A proper ClickHouse vs. PostgreSQL comparison would probably take another 8,000 words. To avoid making this post even longer, we opted to provide a short comparison of the two databases - but if anyone wants to provide a more detailed comparison, we would love to read it.)

As we can see above, ClickHouse is a well-architected database for OLAP workloads. Conversely, PostgreSQL is a well-architected database for OLTP workloads.

Also, PostgreSQL isn’t just an OLTP database: it’s the fastest growing and most loved OLTP database (DB-Engines, StackOverflow 2021 Developer Survey).

As a result, we won’t compare the performance of ClickHouse vs. PostgreSQL because - to continue our analogy from before - it would be like comparing the performance of a bulldozer vs. a car. These are two different things designed for two different purposes.

We’ve already established why ClickHouse is excellent for analytical workloads. Let’s now understand why PostgreSQL is so loved for transactional workloads: versatility, extensibility, and reliability.

PostgreSQL versatility and extensibility

Versatility is one of the distinguishing strengths of PostgreSQL. It's one of the main reasons for the recent resurgence of PostgreSQL in the wider technical community.

PostgreSQL supports a variety of data types including arrays, JSON, and more. It supports a variety of index types - not just the common B-tree but also GIST, GIN, and more. Full text search? Check. Role-based access control? Check. And of course, full SQL.

Also, through the use of extensions, PostgreSQL can retain the things it's good at while adding specific functionality to enhance the ROI of your development efforts.

Does your application need geospatial data? Add the PostGIS extension. What about features that benefit time-series data workloads? Add TimescaleDB. Could your application benefit from the ability to search using trigrams? Add pg_trgm.

With all these capabilities, PostgreSQL is quite flexible - which means that it is essentially future-proof. As your application changes, or as your workloads change, you will know that you can still adapt PostgreSQL to your needs.

(For one specific example of the powerful extensibility of PostgreSQL, please read how our engineering team built functional programming into PostgreSQL using customer operators.)

PostgreSQL reliability

As developers, we’re resolved to the fact that programs crash, servers encounter hardware or power failures, disks fail or experience corruption. You can mitigate this risk (e.g., robust software engineering practices, uninterrupted power supplies, disk RAID, etc.), but not eliminate it completely; it’s a fact of life for systems.

In response, databases are built with an array of mechanisms to further reduce such risk, including streaming replication to replicas, full-snapshot backup and recovery, streaming backups, robust data export tools, etc.

PostgreSQL has the benefit for 20+ years of development and usage, which has resulted in not just a reliable database, but also a broad spectrum of rigorously tested tools: streaming replication for high availability and read-only replicas, pg_dump and pg_recovery for full database snapshots, pg_basebackup and log shipping / streaming for incremental backups and arbitrary point-in-time recovery, pgBackrest or WAL-E for continuous archiving to cloud storage, and robust COPY FROM and COPY TO tools for quickly importing/exporting data with a variety of formats. This enables PostgreSQL to offer a greater “peace of mind” - because all of the skeletons in the closet have already been found (and addressed).

ClickHouse vs. TimescaleDB

TimescaleDB is the leading relational database for time-series, built on PostgreSQL. It offers everything PostgreSQL has to offer, plus a full time-series database.

As a result, all of the advantages for PostgreSQL also apply to TimescaleDB, including versatility and reliability.

But TimescaleDB adds some critical capabilities that allow it to outperform for time-series data:

  • Hypertables - The foundation for many TimescaleDB features (listed below), hypertables provide automatically partition data across time and space for more performant inserts and queries
  • Continuous aggregates - Intelligently updated materialized views for time-series data. Rather than recreating the materialized view every time, TimescleDB updates data based only on underlying changes to raw data.
  • Columnar compression - Efficient data compression of 90%+ on most time-series data with dramatically improved query performance for historical, long+narrow queries.
  • Hyperfunctions - Analytic focused functions added to PostgreSQL to enhance time-series queries with features like approximate percentiles, efficient downsampling, and two-step aggregation.
  • Function pipelines (released this week!) - Radically improve the developer ergonomics of analyzing data in PostgreSQL and SQL, by applying principles from functional programming and popular tools like Python’s Pandas and PromQL.
  • Horizontal scale-out (multi-node) - Horizontal scaling of time-series data for both storage and distributed queries across multiple nodes.

ClickHouse vs. TimescaleDB performance for time-series data

Time-series data has exploded in popularity because the value of tracking and analyzing how things change over time has become evident in every industry: DevOps and IT monitoring, industrial manufacturing, financial trading and risk management, sensor data, ad tech, application eventing, smart home systems, autonomous vehicles, professional sports, and more.

It's unique from more traditional business-type (OLTP) data in at least two primary ways: it is primarily insert heavy and the scale of the data grows at an unceasing rate. This impacts both data collection and storage, as well as how we analyze the values themselves. Traditional OLTP databases often can't handle millions of transactions per second or provide effective means of storing and maintaining the data.

Time-series data is also more unique than general analytical (OLAP) data, in that queries generally have a time component, and queries rarely touch every row in the database.

Over the last few years, however, the lines between the capabilities of OLTP and OLAP databases have started to blur. For the last decade, the storage challenge was mitigated by numerous NoSQL architectures, while still failing to effectively deal with the query and analytics required of time-series data.

As a result many applications try to find the right balance between the transactional capabilities of OLTP databases and the large-scale analytics provided by OLAP databases. It makes sense, therefore, that many applications would try to use ClickHouse, which offers fast ingest and analytical query capabilities, for time-series data.

So, let's see how both ClickHouse and TimescaleDB compare for time-series workloads using our standard TSBS benchmarks.

Performance Benchmarks

Let me start by saying that this wasn't a test we completed in a few hours and then moved on from. In fact, just yesterday, while finalizing this blog post, we installed the latest version of ClickHouse (released 3 days ago) and ran all of the tests again to ensure we had the best numbers possible! (benchmarking, not benchmarketing)

In preparation for the final set of tests, we ran benchmarks on both TimescaleDB and ClickHouse dozens of times each - at least. We tried different cardinalities, different lengths of time for the generated data, and various settings for things that we had easy control over - like "chunk_time_interval" with TimescaleDB. We wanted to really understand how each database would perform with typical cloud hardware and the specs that we often see in the wild.

We also acknowledge that most real-world applications don't work like the benchmark does: ingesting data first and querying it second. But separating each operation allows us to understand which settings impacted each database during different phases, which also allowed us to tweak benchmark settings for each database along the way to get the best performance.

Finally, we always view these benchmarking tests as an academic and self-reflective experience. That is, spending a few hundred hours working with both databases often causes us to consider ways we might improve TimescaleDB (in particular), and thoughtfully consider when we can-  and should - say that another database solution is a good option for specific workloads.

​​Machine Configuration

For this benchmark, we made a conscious decision to use cloud-based hardware configurations that were reasonable for a medium-sized workload typical of startups and growing businesses. In previous benchmarks, we've used bigger machines with specialized RAID storage, which is a very typical setup for a production database environment.

But, as time has marched on and we see more developers use Kubernetes and modular infrastructure setups without lots of specialized storage and memory optimizations, it felt more genuine to benchmark each database on instances that more closely matched what we tend to see in the wild. Sure, we can always throw more hardware and resources to help spike numbers, but that often doesn't help convey what most real-world applications can expect.

To that end, for comparing both insert and read latency performance, we used the following setup in AWS:

  • Versions: TimescaleDB version 2.4.0, community edition, with PostgreSQL 13; ClickHouse version 21.6.5 (the latest non-beta releases for both databases at the time of testing).
  • 1 remote client machine running TSBS, 1 database server, both in the same cloud datacenter
  • Instance size: Both client and database server ran on Amazon EC2 virtual machines (m5.4xlarge) with 16 vCPU and 64GB Memory each.
  • OS: Both server and client machines ran Ubuntu 20.04.3
  • Disk Size: 1TB of EBS GP2 storage
  • Deployment method: Installed via apt-get using official sources

Database configuration

ClickHouse: No configuration modification was done with the ClickHouse. We simply installed it per their documentation. There is not currently a tool like timescaledb-tune for ClickHouse.

TimescaleDB: For TimescaleDB, we followed the recommendations in the timescale documentation. Specifically, we ran timescaledb-tune and accepted the configuration suggestions which are based on the specifications of the EC2 instance. We also set synchronous_commit=off in postgresql.conf. This is a common performance configuration for write-heavy workloads while still maintaining transactional, logged integrity.

Insert performance

For insert performance, we used the following datasets and configurations. The datasets were created using Time-Series Benchmarking Suite with the cpu-only use case.

  • Dataset: 100-1,000,000 simulated devices generated 10 CPU metrics every 10 seconds for ~100 million reading intervals.
  • Intervals used for each configuration are as follows: 31 days for 100 devices; 3 days for 4,000 devices; 3 hours for 100,000 devices; 30 minutes for 1,000,000
  • Batch size: Inserts were made using a batch size of 5,000 which was used for both ClickHouse and TimescaleDB. We tried multiple batch sizes and found that in most cases there was little difference in overall insert efficiency between 5,000 and 15,000 rows per batch with each database.
  • TimescaleDB chunk size: We set the chunk time depending on the data volume, aiming for 7-16 chunks in total for each configuration (more on chunks here).

In the end, these were the performance numbers for ingesting pre-generated time-series data from the TSBS client machine into each database using a batch size of 5,000 rows.

Table showing the final insert results between ClickHouse and TimescaleDB when using larger 5,000 rows/batch
Insert performance comparison between ClickHouse and TimescaleDB with 5,000 row/batches

To be honest, this didn't surprise us. We've seen numerous recent blog posts about ClickHouse ingest performance, and since ClickHouse uses a different storage architecture and mechanism that doesn't include transaction support or ACID compliance, we generally expected it to be faster.

The story does change a bit, however, when you consider that ClickHouse is designed to save every "transaction" of ingested rows as separate files (to be merged later using the MergeTree architecture). It turns out that when you have much lower batches of data to ingest, ClickHouse is significantly slower and consumes much more disk space than TimescaleDB.

(Ingesting 100 million rows, 4,000 hosts, 3 days of data - 22GB of raw data)

Table showing the impact of using smaller batch sizes has on TimescaleDB and ClickHouse. TimescaleDB insert performance and disk usage stays steady, while ClickHouse performance is negatively impacted
Insert performance comparison between ClickHouse and TimescaleDB using smaller batch sizes, which significantly impacts ClickHouse's performance and disk usage

Do you notice something in the numbers above?

Regardless of batch size, TimescaleDB consistently consumed ~19GB of disk space with each data ingest benchmark before compression. This is a result of the chunk_time_interval which determines how many chunks will get created for a given range of time-series data. Although ingest speeds may decrease with smaller batches, the same chunks are created for the same data, resulting in consistent disk usage patterns. Before compression, it's easy to see that TimescaleDB continually consumes the same amount of disk space regardless of the batch size.

By comparison, ClickHouse storage needs are correlated to how many files need to be written (which is partially dictated by the size of the row batches being saved), it can actually take significantly more storage to save data to ClickHouse before it can be merged into larger files. Even at 500-row batches, ClickHouse consumed 1.75x more disk space than TimescaleDB for a source data file that was 22GB in size.

Read latency

For benchmarking read latency, we used the following setup for each database (the machine configuration is the same as the one used in the Insert comparison):

  • Dataset: 4,000/10,000 simulated devices generated 10 CPU metrics every 10 seconds for 3 full days (100M+ reading intervals, 1B+ metrics)
  • We also enabled native compression on TimescaleDB. We compressed everything but the most recent chunk of data, leaving it uncompressed. This configuration is a commonly recommended one where raw, uncompressed data is kept for recent time periods and older data is compressed, enabling greater query efficiency (see our compression docs for more). The parameters we used to enable compression are as follows: We segmented by the tags_id columns and ordered by time descending and usage_user columns.

On read (i.e., query) latency, the results are more complex. Unlike inserts, which primarily vary on cardinality size (and perhaps batch size), the universe of possible queries is essentially infinite, especially with a language as powerful as SQL. Often, the best way to benchmark read latency is to do it with the actual queries you plan to execute. For this case, we use a broad set of queries to mimic the most common query patterns.

The results shown below are the median from 1000 queries for each query type. Latencies in this chart are all shown as milliseconds, with an additional column showing the relative performance of TimescaleDB compared to ClickHouse (highlighted in green when TimescaleDB is faster, in blue when ClickHouse is faster).

Table showing query response results when querying 4,000 hosts and 100 million rows of data. TimescaleDB outperforms in almost all query categories.
Results of benchmarking query performance of 4,000 hosts with 100 million rows of data
Table showing query response results when querying 10,000 hosts and 100 million rows of data. TimescaleDB outperforms in almost all query categories.
Results of benchmarking query performance of 10,000 hosts with 100 million rows of data

SIMPLE ROLLUPS

For simple rollups (i.e., single-groupby), when aggregating one metric across a single host for 1 or 12 hours, or multiple metrics across one or multiple hosts (either for 1 hour or 12 hours), TimescaleDB generally outperforms ClickHouse at both low and high cardinality. In particular, TimescaleDB exhibited up to 1058% the performance of ClickHouse on configurations with 4,000 and 10,000 devices with 10 unique metrics being generated every read interval.

AGGREGATES

When calculating a simple aggregate for 1 device, TimescaleDB consistently outperforms ClickHouse across any number of devices. In our benchmark, TimescaleDB demonstrates 156% the performance of ClickHouse when aggregating 8 metrics across 4000 devices, and 164% when aggregating 8 metrics across 10,000 devices. Once again, TimescaleDB outperforms ClickHouse for high-end scenarios.

DOUBLE ROLLUPS

The one set of queries that ClickHouse consistently bested TimescaleDB in query latency was in the double rollup queries that aggregate metrics by time and another dimension (e.g., GROUPBY time, deviceId). We'll go into a bit more detail below on why this might be, but this also wasn't completely unexpected.

THRESHOLDS

When selecting rows based on a threshold, TimescaleDB demonstrates between 249-357% the performance of ClickHouse when computing thresholds for a single device, but only 130-58% the performance of ClickHouse when computing thresholds for all devices for a random time window.

COMPLEX QUERIES

For complex queries that go beyond rollups or thresholds, the comparison is a bit more nuanced, particularly when looking at TimescaleDB. The difference is that TimescaleDB gives you control over which chunks are compressed. In most time-series applications, especially things like IoT, there's a constant need to find the most recent value of an item or a list of the top X things by some aggregation. This is what the lastpoint and groupby-orderby-limit queries benchmark.

As we've shown previously with other databases (InfluxDB and MongoDB), and as ClickHouse documents themselves, getting individual ordered values for items is not a use case for a MergeTree-like/OLAP database, generally because there is no ordered index that you can define for a time, key, and value. This means asking for the most recent value of an item still causes a more intense scan of data in OLAP databases.

We see that expressed in our results. TimescaleDB was around 3486% faster than ClickHouse when searching for the most recent values (lastpoint) for each item in the database. This is because the most recent uncompressed chunk will often hold the majority of those values as data is ingested and a great example of why this flexibility with compression can have a significant impact on the performance of your application.

We fully admit, however, that compression doesn't always return favorable results for every query form. In the last complex query, groupby-orderby-limit, ClickHouse bests TimescaleDB by a significant amount, almost 15x faster. What our results didn't show is that queries that read from an uncompressed chunk (the most recent chunk) are 17x faster than ClickHouse, averaging 64ms per query. The query looks like this in TimescaleDB:

SELECT time_bucket('60 seconds', time) AS minute, max(usage_user)
        FROM cpu
        WHERE time < '2021-01-03 15:17:45.311177 +0000'
        GROUP BY minute
        ORDER BY minute DESC
        LIMIT 5

As you might guess, when the chunk is uncompressed, PostgreSQL indexes can be used to quickly order the data by time. When the chunk is compressed, the data matching the predicate (`WHERE time < '2021-01-03 15:17:45.311177 +0000'` in the example above) must first be decompressed before it is ordered and searched.

When the data for a `lastpoint` query falls within an uncompressed chunk (which is often the case with near-term queries that have a predicate like `WHERE time < now() - INTERVAL '6 hours'`), the results are startling.

(uncompressed chunk query, 4k hosts)

Table showing the positive impact querying uncompressed data in TimescaleDB can have, specifically the lastpoint and groupby-orderby-limit queries.
Query latency performance when lastpoint and groupby-orderby-limit queries use an uncompressed chunk in TimescaleDB

One of the key takeaways from this last set of queries is that the features provided by a database can have a material impact on the performance of your application. Sometimes it just works, while other times having the ability to fine-tune how data is stored can be a game-changer.

Read latency performance summary

  • For simple queries, TimescaleDB outperforms ClickHouse, regardless of whether native compression is used.
  • For typical aggregates, even across many values and items, TimescaleDB outperforms ClickHouse.
  • Doing more complex double rollups, ClickHouse outperforms TimescaleDB every time. To some extent we were surprised by the gap and will continue to understand how we can better accommodate queries like this on raw time-series data. One solution to this disparity in a real application would be to use a continuous aggregate to pre-aggregate the data.
  • When selecting rows based on a threshold, TimescaleDB outperforms ClickHouse and is up to 250% faster.
  • For some complex queries, particularly a standard query like "lastpoint", TimescaleDB vastly outperforms ClickHouse
  • Finally, depending on the time range being queried, TimescaleDB can be significantly faster (up to 1760%) than ClickHouse for grouped and ordered queries. When these kinds of queries reach further back into compressed chunks, ClickHouse outperforms TimescaleDB because more data must be decompressed to find the appropriate max() values to order by.

Conclusion

You made it to the end! Thank you for taking the time to read our detailed report.

Understanding ClickHouse, and then comparing it with PostgreSQL and TimescaleDB, made us appreciate that there is a lot of choice in today’s database market - but often there is still only one right tool for the job.

Before making a decision on which to use for your application, we recommend taking a step back and analyzing your stack, your team's skills, and what your needs are, now and in the future. Choosing the best technology for your situation now can make all the difference down the road. Instead, you want to pick an architecture that evolves and grows with you, not one that forces you to start all over when the data starts flowing from production applications.

We’re always interested in feedback, and we’ll continue to share our insights with the greater community.

Want to learn more about TimescaleDB?

Create a free account to get started with a fully-managed TimescaleDB instance (100% free for 30 days).

Want to host TimescaleDB yourself? Visit our GitHub to learn more about options, get installation instructions, and more (and, as always, ⭐️  are  appreciated!)

Join our Slack community to ask questions, get advice, and connect with other developers (the authors of this post, as well as our co-founders, engineers, and passionate community members are active on all channels).


REGINA OBE: FOSS4G 2021 Bueno Aires Videos are out

Burak Velioğlu: How to scale Postgres for time series data with Citus

$
0
0

Managing time series data at scale can be a challenge. PostgreSQL offers many powerful data processing features such as indexes, COPY and SQL—but the high data volumes and ever-growing nature of time series data can cause your database to slow down over time.

Fortunately, Postgres has a built-in solution to this problem: Partitioning tables by time range.

Partitioning with the Postgres declarative partitioning feature can help you speed up query and ingest times for your time series workloads. Range partitioning lets you create a table and break it up into smaller partitions, based on ranges (typically time ranges). Query performance improves since each query only has to deal with much smaller chunks. Though, you’ll still be limited by the memory, CPU, and storage resources of your Postgres server.

The good news is you can scale out your partitioned Postgres tables to handle enormous amounts of data by distributing the partitions across a cluster. How? By using the Citus extension to Postgres. In other words, with Citus you can create distributed time-partitioned tables. To save disk space on your nodes, you can also compress your partitions—without giving up indexes on them. Even better: the latest Citus 10.2 open-source release makes it a lot easier to manage your partitions in PostgreSQL.

This post is your “how-to” guide to using Postgres with Citus and pg_cron for time series data—effectively transforming PostgreSQL into a distributed time series database. By using Postgres and Citus together, your application will be more performant at handling the ever-coming massive amounts of time series data—making your life easier.

Time series database capabilities explained in this post

  • Partitioning: how to use Postgres native partitioning to split large timeseries tables into smaller time partitions
  • Easy Partition Management: how to use new functions in Citus to simplify management of partitions
  • Compression: how to use Citus Columnar to compress older partitions, save on storage, and improve query performance
  • Automation: how to use the pg_cron extension to schedule and automate partition management
  • Sharding: how to shard Postgres partitions on single-node Citus
  • Distributing across nodes: how to distribute sharded partitions across nodes of a Citus database cluster, for high performance and scale

How to use Postgres’ built-in partitioning for time series data

Postgres’ built-in partitioning is super useful for managing time series data.

By partitioning your Postgres table on a time column by range (thereby creating a time-partitioned table), you can have a table with much smaller partition tables and much smaller indexes on those partitions—instead of a single huge table.

  • Smaller tables and smaller indexes usually mean faster query responses.
  • Having partitions for different time spans makes it more efficient to drop/delete/expire old data.

To partition your Postgres tables, you first need to create a partitioned table. Partitioned tables are virtual tables and have no storage of their own. You must create partitions to store a subset of the data as defined by its partition bounds, using the PARTITION BY RANGE syntax. Once you ingest data into the partitioned table, Postgres will store the data in the appropriate partition, based on the partitioning key you defined.

-- create a parent table partitioned by rangeCREATETABLEtime_series_events(event_timetimestamp,eventint,user_idint)PARTITIONBYRANGE(event_time);-- create partitions for that partitioned tableCREATETABLEtime_series_events_p2021_10_10PARTITIONOFtime_series_eventsFORVALUESFROM('2021-10-10 00:00:00')TO('2021-10-11 00:00:00');CREATETABLEtime_series_events_p2021_10_11PARTITIONOFtime_series_eventsFORVALUESFROM('2021-10-11 00:00:00')TO('2021-10-12 00:00:00');CREATETABLEtime_series_events_p2021_10_12PARTITIONOFtime_series_eventsFORVALUESFROM('2021-10-12 00:00:00')TO('2021-10-13 00:00:00');-- insert rows into a partitioned tableINSERTINTOtime_series_eventsVALUES('2021-10-10 12:00:00',1,2);INSERTINTOtime_series_eventsVALUES('2021-10-11 12:00:00',1,2);INSERTINTOtime_series_eventsVALUES('2021-10-12 12:00:00',1,2);

If you don’t need a partition anymore, you can manually drop it like dropping a normal Postgres table.

-- drop partitionsDROPTABLEtime_series_events_p2021_10_10;DROPTABLEtime_series_events_p2021_10_11;DROPTABLEtime_series_events_p2021_10_12;

By using Postgres’ built-in partitioning, you can utilize resources of the node you are running Postgres on more wisely, but you’ll need to spend time to manage those partitions by yourself. Unless you take advantage of the new Citus UDFs that simplify Postgres partition management, read on…

How to use new Citus functions to simplify partition management

Citus 10.2 adds two new user-defined functions (UDFs) to simplify how you can manage your Postgres time partitions: create_time_partitions and drop_old_time_partitions. Using these 2 new Citus UDFs, you no longer need to create or drop time partitions manually.

Both new Citus UDFs can be used with both regular Postgres tables and distributed Citus tables.

  • create_time_partitions(table_name regclass, partition_interval interval, end_at timestamp with time zone, start_from timestamp with time zone DEFAULT now()): For the given table and interval, create as many partitions as necessary for the given time range.
  • drop_old_time_partitions(table_name regclass, older_than timestamp with time zone): For the given table, drop all the partitions that are older than the given timestamp
-- create partitions per day from 2021-10-10 to 2021-10-30SELECTcreate_time_partitions(table_name:='time_series_events',partition_interval:='1 day',end_at:='2021-10-30',start_from:='2021-10-10');

You can use the Citus time_partitions view to get details of time-partitioned tables on your cluster.

-- check the details of partitions from time_partitions viewSELECTpartition,from_value,to_value,access_methodFROMtime_partitions;partition|from_value|to_value|access_method--------------------------------+---------------------+---------------------+---------------time_series_events_p2021_10_10|2021-10-1000:00:00|2021-10-1100:00:00|heaptime_series_events_p2021_10_11|2021-10-1100:00:00|2021-10-1200:00:00|heaptime_series_events_p2021_10_12|2021-10-1200:00:00|2021-10-1300:00:00|heaptime_series_events_p2021_10_13|2021-10-1300:00:00|2021-10-1400:00:00|heaptime_series_events_p2021_10_14|2021-10-1400:00:00|2021-10-1500:00:00|heaptime_series_events_p2021_10_15|2021-10-1500:00:00|2021-10-1600:00:00|heaptime_series_events_p2021_10_16|2021-10-1600:00:00|2021-10-1700:00:00|heaptime_series_events_p2021_10_17|2021-10-1700:00:00|2021-10-1800:00:00|heaptime_series_events_p2021_10_18|2021-10-1800:00:00|2021-10-1900:00:00|heaptime_series_events_p2021_10_19|2021-10-1900:00:00|2021-10-2000:00:00|heaptime_series_events_p2021_10_20|2021-10-2000:00:00|2021-10-2100:00:00|heaptime_series_events_p2021_10_21|2021-10-2100:00:00|2021-10-2200:00:00|heaptime_series_events_p2021_10_22|2021-10-2200:00:00|2021-10-2300:00:00|heaptime_series_events_p2021_10_23|2021-10-2300:00:00|2021-10-2400:00:00|heaptime_series_events_p2021_10_24|2021-10-2400:00:00|2021-10-2500:00:00|heaptime_series_events_p2021_10_25|2021-10-2500:00:00|2021-10-2600:00:00|heaptime_series_events_p2021_10_26|2021-10-2600:00:00|2021-10-2700:00:00|heaptime_series_events_p2021_10_27|2021-10-2700:00:00|2021-10-2800:00:00|heaptime_series_events_p2021_10_28|2021-10-2800:00:00|2021-10-2900:00:00|heaptime_series_events_p2021_10_29|2021-10-2900:00:00|2021-10-3000:00:00|heap(20rows)

In time series workloads, it is common to drop (or delete, or expire) old data once they are not required anymore. Having partitions, it becomes very efficient to drop old data as Postgres does not need to read all the data it drops. To make dropping partitions older than a given threshold easier, Citus 10.2 introduced the UDF drop_old_time_partitions.

-- drop partitions older than 2021-10-15CALLdrop_old_time_partitions(table_name:='time_series_events',older_than:='2021-10-15');-- check the details of partitions from time_partitions viewSELECTpartition,from_value,to_value,access_methodFROMtime_partitions;partition|from_value|to_value|access_method--------------------------------+---------------------+---------------------+---------------time_series_events_p2021_10_15|2021-10-1500:00:00|2021-10-1600:00:00|heaptime_series_events_p2021_10_16|2021-10-1600:00:00|2021-10-1700:00:00|heaptime_series_events_p2021_10_17|2021-10-1700:00:00|2021-10-1800:00:00|heaptime_series_events_p2021_10_18|2021-10-1800:00:00|2021-10-1900:00:00|heaptime_series_events_p2021_10_19|2021-10-1900:00:00|2021-10-2000:00:00|heaptime_series_events_p2021_10_20|2021-10-2000:00:00|2021-10-2100:00:00|heaptime_series_events_p2021_10_21|2021-10-2100:00:00|2021-10-2200:00:00|heaptime_series_events_p2021_10_22|2021-10-2200:00:00|2021-10-2300:00:00|heaptime_series_events_p2021_10_23|2021-10-2300:00:00|2021-10-2400:00:00|heaptime_series_events_p2021_10_24|2021-10-2400:00:00|2021-10-2500:00:00|heaptime_series_events_p2021_10_25|2021-10-2500:00:00|2021-10-2600:00:00|heaptime_series_events_p2021_10_26|2021-10-2600:00:00|2021-10-2700:00:00|heaptime_series_events_p2021_10_27|2021-10-2700:00:00|2021-10-2800:00:00|heaptime_series_events_p2021_10_28|2021-10-2800:00:00|2021-10-2900:00:00|heaptime_series_events_p2021_10_29|2021-10-2900:00:00|2021-10-3000:00:00|heap(15rows)

Now, let’s have a look at some more advanced features you can use once you’ve partitioned your Postgres table—particularly if you are dealing with challenges of scale

How to compress your older partitions with Citus columnar (now supports indexes too)

Starting with Citus 10, you can use columnar storage to compress your data in a Postgres table. Combining columnar compression with Postgres time partitioning, you can easily decrease the disk usage for your older partitions. Having partitions stored in a columnar fashion can also increase your analytical query performance as queries can skip over columns they don’t need! If you haven’t checked out columnar compression for Postgres yet, you can start with Jeff’s columnar post for a detailed explanation. Or check out this demo video about how to use Citus columnar compression.

Using the UDF alter_old_partitions_set_access_method you can compress your partitions older than the given threshold by converting the access method from heap to columnar—or vice versa, you can also uncompress by converting the access method from columnar to heap.

  • alter_old_partitions_set_access_method(parent_table_name regclass, older_than timestamp with time zone, new_access_method name): For the given table, compress or uncompress all the partitions that are older than the given threshold.
-- compress partitions older than 2021-10-20CALLalter_old_partitions_set_access_method('time_series_events','2021-10-20','columnar');-- check the details of partitions from time_partitions viewSELECTpartition,from_value,to_value,access_methodFROMtime_partitions;partition|from_value|to_value|access_method--------------------------------+---------------------+---------------------+---------------time_series_events_p2021_10_15|2021-10-1500:00:00|2021-10-1600:00:00|columnartime_series_events_p2021_10_16|2021-10-1600:00:00|2021-10-1700:00:00|columnartime_series_events_p2021_10_17|2021-10-1700:00:00|2021-10-1800:00:00|columnartime_series_events_p2021_10_18|2021-10-1800:00:00|2021-10-1900:00:00|columnartime_series_events_p2021_10_19|2021-10-1900:00:00|2021-10-2000:00:00|columnartime_series_events_p2021_10_20|2021-10-2000:00:00|2021-10-2100:00:00|heaptime_series_events_p2021_10_21|2021-10-2100:00:00|2021-10-2200:00:00|heaptime_series_events_p2021_10_22|2021-10-2200:00:00|2021-10-2300:00:00|heaptime_series_events_p2021_10_23|2021-10-2300:00:00|2021-10-2400:00:00|heaptime_series_events_p2021_10_24|2021-10-2400:00:00|2021-10-2500:00:00|heaptime_series_events_p2021_10_25|2021-10-2500:00:00|2021-10-2600:00:00|heaptime_series_events_p2021_10_26|2021-10-2600:00:00|2021-10-2700:00:00|heaptime_series_events_p2021_10_27|2021-10-2700:00:00|2021-10-2800:00:00|heaptime_series_events_p2021_10_28|2021-10-2800:00:00|2021-10-2900:00:00|heaptime_series_events_p2021_10_29|2021-10-2900:00:00|2021-10-3000:00:00|heap(15rows)

Since you can’t update or delete data on a compressed partition (at least not yet) you may want to use alter_table_set_access_method—which is another Citus’ UDF to compress/uncompress a table—to first uncompress your partition by providing heap as the last parameter. Once your partition is uncompressed and back in row-based storage (aka heap), you can update or delete your data. Then you can compress your partition again by calling alter_table_set_access_method and specifying columnar as the last parameter.

As of Citus 10.2, you can now have indexes1 on the compressed tables. Yes, you can now add indexes to your partitioned tables, even if some partitions of the table are compressed.

-- create index on a partitioned table with compressed partitionsCREATEINDEXindex_on_partitioned_tableONtime_series_events(user_id);

With these UDFs that make it easier to manage your time partitions—create_time_partitions, drop_old_time_partitions and alter_old_partitions_set_access_method—you can now fully automate partition management using pg_cron.

How to automate partition management with pg_cron

To automate partition management altogether, you can use pg_cron, an open source extension created and maintained by our team. pg_cron enables you to schedule cron-based jobs on Postgres.

You can use pg_cron to schedule these Citus functions for creating, dropping, and compressing partitions—thereby automating your Postgres partition management. Check out Marco’s pg_cron post for a detailed explanation of its usage and evolution over time.

Below is an example of using pg_cron to fully automate your partition management.

-- schedule cron jobs to-- create partitions for the next 7 daysSELECTcron.schedule('create-partitions','@daily',$$SELECTcreate_time_partitions(table_name:='time_series_events',partition_interval:='1 day',end_at:=now()+'7 days')$$);-- compress partitions older than 5 daysSELECTcron.schedule('compress-partitions','@daily',$$CALLalter_old_partitions_set_access_method('time_series_events',now()-interval'5 days','columnar')$$);-- expire partitions older than 7 daysSELECTcron.schedule('expire-partitions','@daily',$$CALLdrop_old_time_partitions('time_series_events',now()-interval'7 days')$$);

Note that UDFs scheduled above will be called once per each day since the second parameter of the cron.schedule is given as @daily. For other options you can check the cron syntax.

After scheduling those UDFs, you don’t need to think about managing your partitions anymore! Your pg_cron jobs will leverage Citus and Postgres to automatically:

  • create your partitions for the given time span,
  • compress partitions older than given compression threshold, and
  • drop partitions older than given expiration threshold while you are working on your application.

If you want to dive deeper, the use case guide for time series data in our Citus docs will give you a more detailed explanation.

Using Citus’ UDFs to manage partitions and automating them using pg_cron, you can handle your time-series workload on a single Postgres node, hassle-free. Though, you might start to run into performance problems as your database gets bigger. This is where sharding comes in—as well as distributing your database across a cluster—as you can explore in the next two sections.

How to use Citus to shard partitions on a single node

To handle the high data volumes of time series data that cause the database to slow down over time, you can use sharding and partitioning together, splitting your data in 2 dimensions. Splitting your data in 2 dimensions gives you even smaller data and index sizes. To shard Postgres, you can use Citus. And as of Citus 10, you can now use Citus to shard Postgres on a single node—taking advantage of the query parallelism you get from Citus, and making your application “scale out ready”. Or you can use Citus in the way you might be more familiar with, to shard Postgres across multiple nodes.

Here let’s explore how to use Citus to shard your partitions on a single Citus node. And in the next section you’ll see how to use Citus to distribute your sharded partitions across multiple nodes.

To shard with Citus, the first thing you need to do is decide what your distribution column is (sometimes called a sharding key.) To split your data in 2 dimensions and make use of both Postgres partitioning and Citus sharding, the distribution column cannot be the same as the partitioning column. For time series data, since most people partition by time, you just need to pick a distribution column that makes sense for your application. The choosing distribution column guide in the Citus docs gives some useful guidance to help you out here.

Next, to tell Citus to shard your table, you need to use the Citus create_distributed_table function. Even if you are running Citus on a single node, partitions of your table will be sharded on the distribution column you choose. For the purposes of this example, we’ll use user_id as the distribution column.

-- shard partitioned table to have sharded partitioned tableSELECTcreate_distributed_table('time_series_events','user_id');
Figure 1: On the left you see a single Postgres node with a partitioned table. On the right you can see single-node Citus with partitions that have been sharded, too.

Once you shard your partitioned table on a single Citus node, you can easily distribute the table across multiple nodes by scaling out your Citus cluster.

How to use Citus to distribute your sharded partitions across multiple nodes

As your application needs to scale to manage more and more data, the resources (CPU, memory, disk) of the node can become a bottleneck. This is when you will want to shard and distribute Postgres across multiple nodes.

If you’ve already sharded your partitions on single-node Citus, you can easily distribute them by adding more Citus nodes and then rebalancing your tables across the cluster. Let’s explore how to do this below.

To add a new node to the cluster, you first need to add the DNS name (or IP address of that node) and port to the pg_dist_node catalog table using citus_add_node UDF. (If you are running Citus on Azure as part of the managed service, you would just need to move the worker node count slider in the Azure portal to add nodes to the cluster.)

-- Add new node to Citus clusterSELECTcitus_add_node('node-name',5432);

Then, you need to rebalance your tables to move existing shards to a newly added node. You can use rebalance_table_shards to rebalance shards evenly among the nodes.

-- rebalance shards evenly among the nodesSELECTrebalance_table_shards();
Figure 2: Distributing sharded partitions across multiple nodes of a Citus database cluster
  • Alternately, if you already have a multi-node Citus cluster and want to shard a partitioned table across the nodes in the cluster, you will need to use create_distributed_table to do so. Partitions will then be sharded across the nodes, on the distribution column that you specified.

You can check the Citus documentation for an even more detailed explanation of cluster management.

Other time partitioning extensions for PostgreSQL

Before we introduced the time-partitioning UDFs, the common approach to time-partitioning in Citus was to use the pg_partman extension. For some advanced scenarios, using pg_partman can still be beneficial. In particular, pg_partman supports creating a “template table”. You can use “template table” to have different indexes on different partitions. You can even create unique indexes not covering partitioning column with “template tables”. You can check pg_partman’s documentation for all other functionalities.

Another time-partitioning extension for PostgreSQL is TimescaleDB. Unfortunately, Citus and TimescaleDB are currently not compatible. Importantly, Citus has a very mature distributed query engine that scales from a single node to petabyte scale time series workloads. Citus is also fully compatible with PostgreSQL 14, which has new features for handling time series data such as the date_bin function.

Citus as a distributed relational time series database

This post shows you how to manage your time series data in a scalable way. By combining the partitioning capabilities of Postgres, the distributed database features of Citus, and the automation of pg_cron—you get a distributed relational time series database. The new Citus user-defined functions for creating and dropping partitions make things so much easier, too.

CapabilityDescription of Time series database features in Postgres with Citus
PartitioningUse Postgres native range partitioning feature to split larger tables into smaller partitions, using time periods as the “range”
Partition ManagementUse new functions in Citus to simplify management of time partitions
CompressionUse Citus Columnar to compress older partitions, saving on storage as well as improving query performance
AutomationUse the Postgres pg_cron extension to schedule partition creation, deletion and compression
ShardingUse Citus to shard Postgres tables and increase query performance, either on single-node Citus or across a cluster
Distributing across nodesUse Citus to distribute shards across multiple nodes; to enable parallel, distributed queries; & to use memory, CPU, and storage from multiple nodes

If you want to try using Postgres with the Citus extension, you can download Citus packages or provision Citus in the cloud as a managed database service. You can learn about the latest Citus release in Onder’s 10.2 blog post or check out the time series use case guide in docs.

And if you have questions about scaling your time series workload or Citus in general, feel free to ping us via our public slack channel. To dig in further and try out Citus, our getting started page is a useful place to start.


Footnotes

  1. As of Citus 10.2, hash and btree index types are now supported with Citus columnar.

This article was originally published on citusdata.com.

Suman Michael: Organizing PostgreSQL codebase using templates in Golang

$
0
0

While migrating to PostgreSQL, have you ever wondered whether you could make changes to SQL-only code without triggering a build, re-deployment, or even a restart? Have you ever attempted packaging software as a binary and allowing clients to execute their own compatible queries on any DB engine at no additional cost? In such cases, a better approach to organizing SQL queries in Golang is required. We recently developed an in-house micro-service using golang and PostgreSQL by leveraging the go's template language feature. In the article, we are going to discuss about Organizing PostgreSQL codebase using templates in Golang. For the purpose of this article, some basic utilities are written leveraging Go templates as seen in the MigOps repository : GoTSQL, as a Golang package.

Quick setup of GoTSQL

  1. Installation through Go Get command.

    $ go get github.com/migopsrepos/gotsql
  2. Initialize the GoTSQL object

    import "github.com/migopsrepos/gotsql"
    ...
    g := gotsql.GoTSQL{}
  3. Load the template file or directory with template files. (Templates are discussed in detail in next section)

    g.Load("library/")
    ...
    g.Load("library_dev/")
  4. Get the query using namespace and use it in Golang

    query, err := g.Get("library/books/select/getAllBooks", nil)
    ...
    rows, err := db.Query(query, ...)

That's all; with GoTSQL, separating SQL code from the Go source and maintaining it is now simple. Let's take a closer look into GoTSQL Templates in details.

GoTSQL Templates

Templates are text files that are used to generate dynamic content. Go templates and text/template template engine are in use here. Go templates are a great way to modify output in any way you want, whether you're making a web page, sending an email, or it can be even extended to maintain an isolated and modular SQL codebase.Organizing PostgreSQL codebase using templates in Golang

In this blog, we’re going to take a quick look on how to use the Go templates and write GoTSQL files, as well as how to integrate them with an application. We'll use a simple SQL Codebase as an example, which is just a directory with GoTSQL files (Go template files with SQL arranged in a hierarchy. The file structure is as shown below.

- library
   - books
      - select.gotsql
      - stats.gotsql
   - authors
      - select.gotsql
      - filter.gotsql
   - maintenance.gotsql
   - stats.gotsql

Before we learn how to implement it, let’s take a look at the template's syntax. Templates are provided to the appropriate functions either as strings or as “raw string” or in this case a plain text file with extension .gotsql.

Here's a sample .gotsql file which has PostgreSQL template query. You'll learn more as we progress through this blog.

-- FILE: library/authors/select.gotsql
{{ define "getAllAuthors" }}
SELECT * FROM authors 
{{ if .Offset }}OFFSET {{ .Offset }} {{ end }}
{{ if .Limit }}LIMIT {{ .Limit }} {{end}};
{{ end }}

...

{{ define "getAuthor" }}
SELECT * FROM authors WHERE book_id = $1
{{ end }}

...

Actions in templates represent the data evaluations, functions, or control loops. They’re delimited by {{ }}. Other, non-delimited parts are left untouched. By passing a string constant to the define action, you can name the template that is being constructed.

The GoTSQL examples and its corresponding Go snippets mentioned below will assist you in learning how to use it.

Data Evaluations

When you use templates, you usually bind them to a data structure (such as a struct or a map) from which you'll retrieve data.

{{ define "getAllBooksWithOrder" }}

SELECT * FROM books ORDER BY {{ .OrderBy }};

{{ end }}

In the above example, to obtain data from a map, you can use the {{ .OrderBy }} action, which will replace it with the value of keyOrderBy of a given map, on parse time.

query, err := g.Get("library/books/select/getAllBooksWithOrder", map[string]interface{
        "OrderBy": "ISBN",
})

You can also use the {{.}} action to refer to a value of a non-struct types.

Conditions

You can also use if-else statements in templates. For example, you can check if .Offset and .Limit are non-empty, and if they are, append its value to query.

{{ define "getAllAuthors" }}

SELECT * FROM authors 
{{ if .Offset }}OFFSET {{ .Offset }} {{ end }}
{{ if .Limit }}LIMIT {{ .Limit }} {{end}};

{{ end }}
query, err := g.Get("library/authors/select/getAllAuthors", map[string]interface{}{
        "Limit":  "10",
        "Offset": "100",
})

Go templates do support else, else if, conditions and nested statements in actions.

Loops, Functions and Variables

You can loop across a slice using the range action. {{range .Member}} ... {{end}} template is used to define a range action.

The range action creates a variable that is set to the iteration's successive elements. A range action can declare two variables that are separated by a comma. Any variable is prefixed by $.

Here in the below example, let's look at the case of extracting certain columns.. You can use $index to get the index of the slice and $col to get the current iterating element of the Columns slice and implement the logic in the template itself. len and slice are functions. Go also supports custom functions in templates via FuncMap.

Explore more about Pipelines, Variables and Functions in Golang's text/template.

{{ define "getAuthor" }}

{{ $comma_count := len (slice .Columns 1) }}
SELECT
{{ range $index, $col := .Columns}} 
    {{ $col }}{{ if lt $index $comma_count }}, {{ end }} 
{{end}}
FROM authors WHERE author_id = $1

{{ end }}
columns := []string{"firstname", "lastname"}
query, err := g.Get("library/authors/select/getAuthor", map[string]interface{}{
    "Columns": columns,
})
...
rows, err := db.Query(query, authorId)

The general rule of thumb is to treat all user input as untrusted. Hence, It is strongly recommended to use parameterized queries in Golang. database/sqldeals with it effectively and is designed to do so. Here is a quick primer on how to avoid SQL Injection Risk.

This code is meant to generate and maintain the parameterized queries. Hence, it is not advised to parse the user input data using this templates. Notice the WHERE clause, which is parameterized to avoid SQL injection through user input.

To put this into context, we should consider utilizing this tool to produce only parameterized queries to ensure optimum safety.

Why this approach?

SQL is a text-based language that usually belongs in a text file. Storing SQL as text, outside of compiled code, eases maintenance significantly. For applications that rely heavily on databases, separating Go and SQL source code is essential. And your database administrator will appreciate it.

The following are a few of the advantages of this approach:

  • Simplified SQL codebase maintenance that isolates code changes and, most crucially, provides a framework for enabling Version Control (Git Integration), multiple SQL dialects, and optimizations with the same Go code.
  • Composing queries with a template syntax made easier than writing them explicitly in Go code.
  • Having SQL segregated from code makes it much easier for new developers to browse through SQL statements and grasp the database structure. And it allows those who are familiar with SQL to contribute directly to a project without having to learn about all of the other tools involved.
  • It reduces the negative consequences of database modifications. For example, If the name of a table or column changes, the fix is typically as simple as a search and replace in a single text file.
  • It dramatically decreases development cost by allowing users to rapidly import files from an organized template repository and make nominal changes to integrate them into similar projects.
  • If an older application needs to be migrated to a new framework, then the SQL is readily available for porting to the new application.
  • If necessary, Changes to SQL can be done after deployment by updating the text-based template file and restarting or refreshing the server (assuming interfaces are the same).
  • Separation between programmer and DBA responsibilities. There is no confusion about who is responsible for what. The code is done by developers, while the scripts are written by database developers. This gives DBAs a better feasibility to write SQL with higher performance (faster run-time) than using an ORM tool. If both do a decent job and the database design is correct, the application will outperform all other ORM frameworks.

Cons, Really?

  • The fact that the queries are stored in another file appears to obfuscate it from source code, encouraging the programmer to be ignorant of the database. That could be a useful abstraction or a smart division of labor. This problem may be mitigated if the SQL templates were organized and named properly.
  • Some may argue that ORM (Object-Relational-Mapping) should be used instead. ORM frameworks are designed to let you deal with objects and query abstractions. However, if your application is heavily reliant on databases and requires DBA support, ORM won't help you, but it will make life difficult for DBAs. Read this to know why you shouldn't always look up to ORM ?
  • Refactoring could be hindered by a wider separation between Go and Database SQL code.
  • Because SQL isn't hidden in the compiled code, it's easier for a third party to read and reverse engineer the application, which may or may not be a concern for a commercial company.
  • When you rename, delete, or add a field, multiple files can become out of sync. Is there a chance you'll encounter a compilation error? Will you be able to locate all of the areas where you need to make changes? Or Is that anything you have to look for in testing ?

Conclusion

In this article we showed how to use Go templates and maintain PostgreSQL codebase using GoTSQL. The examples provided in this article can be used as a reference to build your own GoTSQL templates. While we have a lot of advantages with this approach, the challenges related to the SQL code being obfuscated from the source code, may be minimal.

What's next?

GoTSQL's next release will feature native support for registering native golang functions in template actions both globally and at the template level. This will greatly enhance the capability of generating SQL Queries with custom logic. Please watch out for my next article to see how it works.

Migrating to PostgreSQL ?

If you are looking to migrate to PostgreSQL or looking for support in managing PostgreSQL databases, please contact us or fill the following form.

The post Organizing PostgreSQL codebase using templates in Golang appeared first on MigOps.

Hubert 'depesz' Lubaczewski: Why is it hard to automatically suggest what index to create?

Regina Obe: PostGIS 3.2.0beta1 Released

$
0
0

The PostGIS Team is pleased to release the first beta of the upcoming PostGIS 3.2.0 release.

Best served with PostgreSQL 14. This version of PostGIS utilizes the faster GiST building support API introduced in PostgreSQL 14. If compiled with recently released GEOS 3.10.0 you can take advantage of improvements in ST_MakeValid and numerous speed improvements. This release also includes many additional functions and improvements for postgis_raster and postgis_topology extensions.

Continue Reading by clicking title hyperlink ..

Andreas 'ads' Scherbaum: Alexander Kukushkin

$
0
0
PostgreSQL Person of the Week Interview with Alexander Kukushkin: My name is Alexander. Originally I am from Russia. Since 2013 I live in Berlin and work as a Database Engineer at Zalando.

Jagadeesh Panuganti: Migration Management and schema version control in PostgreSQL

$
0
0

Version control of any code is essential for every organization. All organizations use their own preferred tools like git, svn, perforce etc. As I was working on a new requirement, which is about porting the current CI/CD pipeline to PostgreSQL, I needed to look for a tool which will take care of the schema changes. In this use case, the customer is maintaining all the schema in a separate directory, and creating the objects/schema during the application bootstrap process. This is the behavior most of the applications do follow, but having a dedicated schema change management system is something unique. In this article, we shall discuss about managing version control schema in PostgreSQL using node-pg-migrate.

There are a good number of open source tools for the schema change management requirements. As a NodeJS developer, I was looking for a tool written Migration Management and schema version control in PostgreSQLin NodeJS for schema version control in PostgreSQL. Fortunately, we found a tool called node-pg-migrate which is written in NodeJS. This tool also follows it's own style of defining the SQL statements. The SQL statements can be managed by a set of predefined javascript functions. By using these functions, we can easily perform the create/alter/drop operations for the SQL objects.

Okay, now let us understand how to perform Schema version control or migration management for PostgreSQL in detail.

Installing node-pg-migrate

We need the latest NodeJS, npm to be installed on the server to install node-pg-migrate. Following commands can be used to perform the installation.

# Installing latest NodeJS
curl -sL https://rpm.nodesource.com/setup_14.x | sudo bash -
sudo yum -y install nodejs

# Installing node-pg-migrate with dependent pg library 
sudo npm install -g node-pg-migrate pg
Creating the First Release

To understand this tool better, let us start with a first release which has only one table in it. You can use the following commands to create the first release.

# Creating a directory for all the SQL release files
$ mkdir releases
$ cd releases/

$ node-pg-migrate create first_release
Created migration -- /home/jagadeesh/releases/migrations/1634906885247_first-release.js

As you see in the above output, this tool actually created a file 1634906885247_first-release.js for this release. We have to edit this file with the list of DDL changes that we are planning to deploy with this release. Now, let us see the contents of this file.

$ cat migrations/1634906885247_first-release.js
/* eslint-disable camelcase */

exports.shorthands = undefined;
exports.up = pgm => {};
exports.down = pgm => {};

In this file, we need to focus on 2 exported objects : up, down that can be used for the following use cases.

up      -> List of SQL statements which have to deploy
down    -> List of SQL statements which will revert this deployment

Now, let us write the first deployment as to deploy only one table and also write the rollback behavior to drop that table. This means, the up behavior is CREATE and the down behavior is DROP.

Find the following up, down object definitions, which we have updated to the file : /home/jagadeesh/releases/migrations/1634906885247_first-release.js

exports.up = (pgm) => {
  pgm.createTable("test", {
    id: "id",
    name: { type: "varchar(1)", notNull: true },
    createdAt: {
      type: "timestamp",
      notNull: true,
      default: pgm.func("current_timestamp"),
    },
  });
};

exports.down = (pgm) => {
  pgm.dropTable("test");
};

In the above file content, we have not used any standard SQL syntax, rather we used node-pg-migrate's functions to create/drop the objects. Also, the JSON style of declaring the table definition is really cool. In the above example, we defined the PostgreSQL table as JSON object with multiple columns : id, name, createdAt and defined the column's specific properties like data type, default value and not null.

And in the down object, we used node-pg-migrate specific function dropTable, dropping the table test to enable the capability of a rollback.

Performing the First Deployment

To perform the deployment and use up command for the node-pg-migrate as below.

$ export DATABASE_URL=postgres:postgres@localhost:5432/postgres
$ node-pg-migrate up
> Migrating files:
> - 1634906885247_first-release
### MIGRATION 1634906885247_first-release (UP) ###
CREATE TABLE "test" (
  "id" serial PRIMARY KEY,
  "name" varchar(1) NOT NULL,
  "createdAt" timestamp DEFAULT current_timestamp NOT NULL
);
INSERT INTO "public"."pgmigrations" (name, run_on) VALUES ('1634906885247_first-release', NOW());

Migrations complete!

From the above results, as you see that the JSON is deserialized as SQL statement, and created the table at the mentioned database location.

Creating the Second Release

As part of the second release, we wanted to add another column to the test table. Following are the steps we followed for performing the same.

# Creating second release
$ node-pg-migrate create second_release
Created migration -- /home/jagadeesh/releases/migrations/1634916250633_second-release.js

# Adding second release changes
$ cat migrations/1634916250633_second-release.js
/* eslint-disable camelcase */

exports.shorthands = undefined;

exports.up = pgm => {
pgm.addColumns('test', {
    another_column: { type: 'text', notNull: true },
 })
};

exports.down = pgm => {
        pgm.dropColumns('test', ['another_column']);
};
Performing the Second Deployment

Now, let us execute the deployment and apply changes to the test table by using the up command.

$ export DATABASE_URL=postgres:postgres@localhost:5432/postgres
$ node-pg-migrate up
> Migrating files:
> - 1634916250633_second-release
### MIGRATION 1634916250633_second-release (UP) ###
ALTER TABLE "test"
  ADD "another_column" text NOT NULL;
INSERT INTO "public"."pgmigrations" (name, run_on) VALUES ('1634916250633_second-release', NOW());

Migrations complete!
Performing a Rollback of the Second Deployment

Let us assume that we decided to rollback the second release. In our example, the rollback of the second release involves reverting the test table changes. By using the node-pg-migrate down command, we can revert the last release changes as seen in the following block.

$ export DATABASE_URL=postgres:postgres@localhost:5432/postgres
$ node-pg-migrate down
> Migrating files:
> - 1634916250633_second-release
### MIGRATION 1634916250633_second-release (DOWN) ###
ALTER TABLE "test"
  DROP "another_column";
DELETE FROM "public"."pgmigrations" WHERE name='1634916250633_second-release';

Migrations complete!

From the above results, we could noticed that the column another_column added as part of the second release has been dropped from the database.

Performing a Rollback of the First Deployment

Assume that you wished to revert the changes done in the first release too. To perform this action, we simple do node-pg-migrate down which will revert the first deployment changes as well.

$ export DATABASE_URL=postgres:postgres@localhost:5432/postgres
$ node-pg-migrate down
> Migrating files:
> - 1634908948307_first-release
### MIGRATION 1634908948307_first-release (DOWN) ###
DROP TABLE "test";
DELETE FROM "public"."pgmigrations" WHERE name='1634908948307_first-release';

Migrations complete!

Whenever we do any operation up and down, then node-pg-migrate will be tracking the release points. Whenever the consecutive up or down happen, then it will follow from the last release point and will execute the release scripts accordingly.

Async Releases

As you have seen so far, the node-pg-migrate is a NodeJS tool, where up and down is calling some set of pre-defined javascript functions. We can also leverage this behavior by calling Promise functions in the up and down sections of the release file.

This async release behavior gives additional flexibility by allowing us to perform releases of all the PostgreSQL related deployments in parallel with other service related changes. For more details about the usage of async releases, please refer to the official documentation of thenode-pg-migrate.

Conclusion

As part of the CI/CD requirement, we explored multiple SQL code change management tools. I personally found that node-pg-migrate is developer friendly and a good asset for the CI/CD pipeline. Thanks to it's extensibility, where we can extend it's behavior by writing our own javascript functions. This tool provides many pre-defined PostgreSQL related functions enabling smoother deployments. As of now, I found this tool very reliable and it is doing it's job as expected.

Migrating to PostgreSQL ?

If you are looking to migrate to PostgreSQL or looking for support in managing PostgreSQL databases, please contact us or fill the following form.

The post Migration Management and schema version control in PostgreSQL appeared first on MigOps.


Jonathan Katz: Encrypting Postgres Data at Rest in Kubernetes

$
0
0

Encrypting data at rest is often an important compliance task when working on securing your database system. While there are a lot of elements that go into securing a PostgreSQL database, encrypting data at rest helps to protect your data from various offline attacks including the stealing of a disk or tampering. Disk encryption is a popular feature among public database-as-a-service providers, including Crunchy Bridge, to protect data in a multi-tenant environment.

Jonathan Katz: Database Security Best Practices on Kubernetes

$
0
0

As more data workloads shift to running on Kubernetes, one of the important topics to consider is security of your data. Kubernetes brings many conveniences for securing workloads with the ability to extend security functionality databases through the use of the Operator pattern. Database security best practices on Kubernetes is a frequent conversation we're having with our customers around deploying PostgreSQL on Kubernetes with PGO, the open source Postgres Operator from Crunchy Data.

Even with security conveniences that Kubernetes provides, you should be aware of some database security best practices when managing your data on Kubernetes. Many Kubernetes Operators help deploy your databases to follow these best practices by default, including as PGO, while also providing ways to add your own customizations to further secure your data.

While we at Crunchy Data hold open source PostgreSQL near-and-dear to our hearts, we wanted to provide a list of best practices for securing databases of any nature on Kubernetes. Let's take a look at what you can to secure your data the cloud-native way!

Laurenz Albe: TCP keepalive for a better PostgreSQL experience

$
0
0

Before there was TCP keepalive
© Laurenz Albe 2021

If you’ve heard about TCP keepalive but aren’t sure what that is, read on. If you’ve ever been surprised by error messages like:

  • server closed the connection unexpectedly
  • SSL SYSCALL error: EOF detected
  • unexpected EOF on client connection
  • could not receive data from client: Connection reset by peer

then this article is for you.

Causes for broken connections

There are several possible causes for broken connections:

Database server crashes

The first two messages in the above list can be the consequence of a PostgreSQL server problem. If the server crashes for whatever reason, you’ll see a message like that. To investigate whether there is a server problem, you should first look into the PostgreSQL log and see if you can find a matching crash report.

We won’t deal with that case in the following, since it isn’t a network problem.

Connections abandoned by the client

If the client exits without properly closing the database connection, the server will get an end-of-file or an error while communicating on the network socket. With the new session statistics introduced in v14, you can track the number of such “abandoned” database connections in pg_stat_database.sessions_abandoned.

For example, if an application server fails and is restarted, it typically won’t close the connections to the database server. This isn’t alarming, and the database server will quickly detect it when the server tries to send data to the client. But if the database session is idle, the server process is waiting for the client to send the next statement (you can see the wait_eventClientRead” in pg_stat_activity). Then the server won’t immediately notice that the client is no longer there! Such lingering backend processes occupy a process slot and can cause you to exceed max_connections.

PostgreSQL v14 has introduced a new parameter idle_session_timeout which closes idle connections after a while. But that will terminate “healthy” idle connections as well, so it isn’t a very good solution. TCP keepalive provides a much better solution to this problem.

Connections closed by a network component

Sometimes both ends of the database connection experience the same problem: each sees that the other end “hung up on them”. In that case, the problem lies somewhere between the database client and the server.

Network connections can get disconnected if there is a real connectivity problem. There’s nothing you can do to change that on the software level. But very often, disconnections are caused by the way firewalls or routers are configured. The network component may have to “memorize” the state of each open connection, and the resources for that are limited. So it can seem expedient to “forget” and drop connections that have been idle for a longer time.

Since a lot of today’s TCP traffic is via HTTP, and HTTP is stateless, that’s not normally a problem. If your HTTP connection is broken, you simply establish a new connection for your next request, which isn’t very expensive. But databases are different:

  • it is expensive to establish a database connection
  • database connections are not stateless; for example, with a closed connection you lose open transactions, temporary tables and prepared statements
  • it is normal for database sessions to be idle for a longer time, for example if you are using a connection pool, or when the client is waiting for the result from a long-running analytical query

This is where TCP keepalive comes in handy as a way to keep idle connections open.

What is TCP keepalive?

Keepalive is a functionality of the TCP protocol. When you set the SO_KEEPALIVE option on a TCP network socket, a timer will start running as soon as the socket becomes idle. When the keepalive idle time has expired without further activity on the socket, the kernel will send a “keepalive packet” to the communication partner. If the partner answers, the connection is considered good, and the timer starts running again.

If there is no answer, the kernel waits for the keepalive interval before sending another keepalive packet. This process is repeated until the number of keepalive packets sent reaches the keepalive count. After that, the connection is considered dead, and attempts to use the network socket will result in an error.

Note that it is the operating system kernel, not the application (database server or client) that sends keepalive messages. The application is not aware of this process.

TCP keepalive serves two purposes:

  • keep network connections from being idle
  • detect if the other communication end has left without closing the network connection
    (The name “keepalive” does not describe that well – “detectdead” would be more to the point).

TCP keepalive default settings

The default values for the keepalive parameters vary from operating system to operating system. On Linux and Windows, the default values are:

  • keepalive idle time: 2 hours on both Linux and Windows
  • keepalive interval: 75 seconds on Linux, 1 second on Windows
  • keepalive count: 9 on Linux, 10 on Windows (this value cannot be changed on Windows)

I could not find the default settings for MacOS.

Using TCP keepalive to keep an idle database session alive

To keep firewalls and routers from closing an idle connection, we need a much lower setting for the keepalive idle time. Then keepalive packets get sent before the connection is closed. This will trick the offending network component into believing that the connection isn’t idle, even if neither database client nor server send any data.

For this use case, keepalive count and keepalive interval are irrelevant. All we need is for the first keepalive packet to be sent early enough.

Using TCP keepalive to detect dead connections

For this use case, reducing the keepalive idle time is often not enough. If the server sends nine keepalive packets with an interval of 75 seconds, it will take more than 10 minutes before a dead connection is detected. So we’ll also reduce the keepalive count, or the keepalive interval, or both – as in this case.

There is still one missing piece to the puzzle: even if the operating system detects that a network connection is broken, the database server won’t notice, unless it tries to use the network socket. If it’s waiting for a request from the client, that will happen immediately. But if the server is busy executing a long-running SQL statement, it won’t notice the dead connection until the query is done and it tries to send the result back to the client! To prevent this from happening, PostgreSQL v14 has introduced the new parameter client_connection_check_interval, which is currently only supported on Linux. Setting this parameter causes the server to “poll” the socket regularly, even if it has nothing to send yet. That way, it can detect a closed connection and interrupt the execution of the SQL statement.

Setting TCP keepalive parameters on the PostgreSQL server

The PostgreSQL server always sets SO_KEEPALIVE on TCP sockets to detect broken connections, but the default idle timeout of two hours is very long.

You can set the configuration parameters tcp_keepalives_idle, tcp_keepalives_interval and tcp_keepalives_count (the last one is not supported on Windows) to change the settings for all server sockets.

This is the most convenient way to configure TCP keepalive for all database connections, regardless of the client used.

Setting TCP keepalive parameters on the PostgreSQL client

The PostgreSQL client shared library libpq has the connection parameters keepalives_idle, keepalives_interval and keepalives_count (again, the latter is not supported on Windows) to configure keepalive on the client side.

These parameters can be used in PostgreSQL connection strings with all client interfaces that link with libpq, for example, Psycopg or PHP.

The PostgreSQL JDBC driver, which does not use libpq, only has a connection parameter tcpKeepAlive to enable TCP keepalive (it is disabled by default), but no parameter to configure the keepalive idle time and other keepalive settings.

Setting TCP keepalive parameters on the operating system

Instead of configuring keepalive settings specifically for PostgreSQL connections, you can change the operating system default values for all TCP connections – which can be useful, if you are using a PostgreSQL client application that doesn’t allow you to set keepalive connection parameters.

On Linux, this is done by editing the /etc/sysctl.conf file:

# detect dead connections after 70 seconds
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_intvl = 5
net.ipv4.tcp_keepalive_probes = 3

To activate the settings without rebooting the machine, run

sysctl -p

On Windows, you change the TCP keepalive settings by adding these registry keys:

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters\KeepAliveTime
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters\KeepAliveInterval

As noted above, there is no setting for the number of keepalive probes, which is hard-coded to 10. The registry keys must be of type DWORD, and the values are in milliseconds rather than in seconds.

After changing these keys, restart Windows to activate them.

Conclusion

Configuring TCP keepalive can improve your PostgreSQL experience, either by keeping idle database connections open, or through the timely detection of broken connections. You can do configure keepalive on the PostgreSQL client, the server, or on the operating system.

In addition to configuring keepalive, set the new parameter client_connection_check_interval to cancel long-running queries when the client has abandoned the session.

 

To learn more about terminating database connections, see our blogpost here.

If you would like to learn about connections settings, see my post about max_connections.

The post TCP keepalive for a better PostgreSQL experience appeared first on CYBERTEC.

Luca Ferrari: pspg lands in OpenBSD

$
0
0

A great pager into a great operating system.

pspg lands in OpenBSD

pspg is a great pager specifically designed for PostgreSQL, or better, for psql, the default and powerful text client for PostgreSQL databases.
But pspg is more than simply a pager for PostgreSQL: it is a general purpose pager for tabular data.

It happened that a few weeks ago I was using an OpenBSD system, and since I had to do some work with PostgreSQL, I decided to install pspg to get some advantages. Unluckily, there was no package for OpenBSD, and most notably, no port in the ports tree.
Therefore, the only chance to install pspg was to compile it from sources, but I failed. I opened an issue to get some help, and after some assistance, I decided to dig deeper. So I asked for help on the misc OpenBSD mailing list and get much more that I was expecting: not only I solved the problem on how to install pspg, but the application was noticed and a proposed for a new port was issued.
In fact, another italian guy, Omar, did prepared and proposed a pspg port, and after a few days the port get included into the ports tree!

What does tha mean? That, at least at the moment of writing, that you can get pspg installed on OpenBSD via the ports:



% cd /usr/ports/databases/pspg
% doas make install===> pspg-5.4.0 depends on: postgresql-client-* -> postgresql-client-13.4p0
===> pspg-5.4.0 depends on: readline-* -> readline-7.0p0
===> pspg-5.4.0 depends on: metaauto-* -> metaauto-1.0p4
===> pspg-5.4.0 depends on: autoconf-2.69 -> autoconf-2.69p3
===> pspg-5.4.0 depends on: gmake-* -> gmake-4.3
===>  Verifying specs: c curses ereadline m panel pq
===>  found c.96.1 curses.14.0 ereadline.2.0 m.10.1 panel.6.0 pq.6.12
===>  Installing pspg-5.4.0 from /usr/ports/packages/amd64/all/
pspg-5.4.0: ok



It is important to note that the ports tree that include pspg, at the time of writing, is the -CURRENT (see here), and therefore there is still some time to wait to get pspg as a package and a port in the -RELEASE ports tree.

Great OpenBSD Job!

I must say that I was astonished by the great work done by Omar and the other OpenBSD volunteers to get the pspg within the ports tree.

Conclusions

pspg is a very useful and interesting pager for tabular like data, and of course this includes output from PostgreSQL’s psql command line client.
With a bit of luck, patience, and the effort of the OpenBSD community, this program will be soon available on OpenBSD too as a package!

Franck Pachot: COPY progression in YugabyteDB

$
0
0

When you want to evaluate the migration from PostgreSQL to YugabyteDB you may:

  • load metadata (DDL) from a pg_dump
  • load data with pg_restore
  • test the application

Here are some quick scripts that I've used for that. And it can give an idea of what can be done to import data from a http endpoint in JSON into a PostgreSQL table.

DDL

The DDL generated by pg_dump creates the table and then adds the primary key. This is fine in PostgreSQL as tables are heap tables and the primary key index is a secondary index. However, YugabyteDB tables are stored in LSM Tree organized (sharded by hash, range and sorted) according to the primary key. Similar to tables in MySQL, clustered indexes in SQL Server or Index Organized Tables in Oracle, but with horizontal partitioning to scale out. So, if you run the CREATE TABLE and ALTER TABLE to add the primary key, you will create them with a generated UUID, and then reorganize the tables. On empty tables, this is not a big deal but better do all in the create statement. One way to do that is using yb_dump, the Yugabyte fork of pg_dump, to export from PostgreSQL. But if you already have the .sql file, here is a quick awk to modify it:

awk'
# gather primary key definitions in pk[] and cancel lines that adds it later
/^ALTER TABLE ONLY/{last_alter_table=$NF}
/^ *ADD CONSTRAINT .* PRIMARY KEY /{sub(/ADD /,"");sub(/;$/,"");pk[last_alter_table]=$0",";$0=$0"\\r"}
# second pass (i.e when NR>FNR): add primary key definition to create table
NR > FNR && /^CREATE TABLE/{ print $0,pk[$3] > "schema_with_pk.sql" ; next}
# disable backfill for faster create index on empty tables
/^CREATE INDEX/ || /^CREATE UNIQUE INDEX/ { sub("INDEX","INDEX NONCONCURRENTLY") }
NR > FNR { print > "schema_with_pk.sql" }
' schema.sql schema.sql

This is a quick-and-dirty two-pass on the DDL to get the primary key definition, and then pur it in the CREATE TABLE statement.

Note that I also change the CREATE INDEX to disable backfill. More info about this in a previous blog post: https://dev.to/yugabyte/create-index-in-yugabytedb-online-or-fast-2dl3

Youu will probably create the secondary indexes after the load, but I like to run the whole DDL once, before loading data, to see if there are some PostgreSQL features that we do not support yet.

COPY progress

Then you may import terabytes of data. This takes time (there is work in progress to optimize this) and you probably want to see the progress. As we commit every 1000 rows by default (can be changed with rows_per_transaction) you can select count(*) but, that takes time on huge tables, requiring to increase the default timeout.

We have many statistics from the tablet server, exposed though the tserver endoint, like http://yb1.pachot.net:9000/metrics, as JSON. But for a PoC you may not have set any statistics collection and want a quick look at the statistics.

Here is a script to gather all tablet metrics into a temporary table:

-- create a temporary table to store the metrics:

drop table if exists my_tablet_metrics;
create temporary table if not exists my_tablet_metrics (
 table_id text,tablet_id text
 ,schema_name text, table_name text
 ,metric text, value numeric
 ,tserver text
 ,timestamp timestamp default now()
);

-- gather the metrics from the tserver:

do $do$ 
 declare
  tserver record;
 begin
  -- gather the list of nodes and loop on them
  for tserver in (select * from yb_servers())
   loop
    -- use COPY from wget/jq to load the metrics
    execute format($copy$
     copy my_tablet_metrics(table_id,tablet_id
     ,schema_name,table_name
     ,metric,value,tserver)
     from program $program$
      wget -O- http://%s:9000/metrics | 
      jq --arg node "%s" -r '.[] 
       |select(.type=="tablet") 
       |.attributes.table_id+"\t"+.id
       +"\t"+.attributes.namespace_name
       +"\t"+.attributes.table_name
       +"\t"+(.metrics[]|select(.value>0)
       |(.name+"\t"+(.value|tostring)))
       +"\t"+$node
       '
     $program$
    with (rows_per_transaction 0)
   $copy$,tserver.host,tserver.host);
   end loop;
 end; 
$do$;

-- aggregate the per-tablet row insertion metrics:

select sum(value),format('%I.%I',schema_name,table_name) 
from my_tablet_metrics where metric='rows_inserted'
group by 2 order by 1
;

I use a server-side COPY FROM PROGRAM here, so you must have installed wget to read from the http endpoint and jq to transform the JSON structure to a tab-separated text.

Here is an example after loading the Northwind demo schema:

Example

If you find any issue when loading your PostgreSQL into Yugabyte, please reach out to us (https://www.yugabyte.com/community) there are many things in the roadmap.

Doug Hunley: Secure PostgreSQL 14 with CIS Benchmark

$
0
0

Crunchy Data is proud to announce an update to the CIS PostgreSQL Benchmark by the Center for Internet Security (CIS). CIS is a nonprofit organization that publishes best practices and standards for securing modern technology and systems. This newly published CIS PostgreSQL 14 Benchmark ads to the existing CIS Benchmarks for PostgreSQL 9.5 - 13 and builds upon Crunchy Data's ongoing efforts with the PostgreSQL Security Technical Implementation Guide (PostgreSQL STIG).

Luca Ferrari: Perl Weekly Challenge 136: PostgreSQL Solutions

$
0
0

My personal solutions to the Perl Weekly Challenge, this time in PostgreSQL!

Perl Weekly Challenge 136: PostgreSQL Solutions

Wait a minute, what the hell is going on? A Perl challenge and PostgreSQL?
Well, it is almost two years now since I’ve started participating regurarly in the Perl Weekly Challenge, and I always solve the tasks in Raku (aka Perl 6).
Today I decided to spend a few minutes in order to try to solve the assigned tasks in PostgreSQL. And I tried to solve them in an SQL way: declaratively.

So here there are my solutions in PostgreSQL for the Challenge 136.


PWC 136 - Task 1

The first task asked to find out if two numbers are friends, meaning that their greatest common divisor should be a positive power of 2. This is quite easy to implement in pure SQL:



CREATEORREPLACEFUNCTIONfriendly(mint,nint)RETURNSintAS$CODE$SELECTCASEgcd(m,n)%2WHEN0THEN1ELSE0END;$CODE$LANGUAGESQL;



The gcd function finds out the greatest common divisor, then I apply the module % operator and catch the remainder: if it is 0 then the gcd is a power of 2, else it is not.

PWC 136 - Task 2

The second task was much more complicated to solve, and required, at least to me, a little try-and-modify approach. Given a specific value, we need to find out all unique combinations of numbers within the Fibonacci sequence that can lead to that value sum.
I decided to solve it via a RECURSIVE Common Table Expression (CTE), due to the fact I need to produce a Fibonacci series:



CREATE OR REPLACE FUNCTION fibonacci_sum( l int DEFAULT 16 )
RETURNS bigint
AS $CODE$

WITH RECURSIVE
fibonacci( n, p ) AS
(
        SELECT 1, 1
        UNION
        SELECT p + n, n
        FROM fibonacci
        WHERE n < l
)
, permutations AS
(
        SELECT n::text AS current_value, n as total_sum
        FROM fibonacci
        UNION
        SELECT current_value || ',' || n, total_sum + n
        FROM permutations, fibonacci
        WHERE
                position( n::text in  current_value ) = 0
       AND n > ALL( string_to_array( current_value, ',' )::int[] )


)
SELECT count(*)
FROM permutations
WHERE total_sum = l
;

$CODE$
LANGUAGE SQL;



The searched value is the argument to the function, that is l.
The first part of the CTE computes the Fibonacci sequence of values that lead to l, and thus we can throw away all the other values since their sum will be greater than l.
The permutations CTE computes a two column materialization: each value from the Fibonacci sequence is appended to the next value, and the sum so far is computed. Note the WHERE clause:

  • the position function checks that the digit has not already be inserted in the list;
  • the n > ALL considers only ordered values, that is 3,5 is a good list, but 5,3 is not because n is 5.

Thanks to the trick of considering only ordered sequences, I can trim out all the sequences that produce the same sum, with the same numbers, in a different order. For example 3, 13 and 13,3 produce the same value, but only the first one is kept.
At this point, it does suffice to count how many tuples there are in permutations to get final answer of the task: how many permutations that lead to l by sum can be found in the Fibonacci series.

Conclusions

Clearly PostgreSQL provides all the features to implement program-like behaviors in a declarative way. Of course, the above solutions are neither the best nor the more efficient that can be implemented, but they demonstrate how powerful PostgreSQL (and more in general, SQL), can be to solve tasks where a few nested loops seem the simpler approach!


Nikolay Samokhvalov: How partial, covering, and multicolumn indexes may slow down UPDATEs in PostgreSQL

$
0
0
How partial, covering, and multicolumn indexes may slow down UPDATEs in PostgreSQL

Based on a true story.

This article was originally published in 2018. This is a reviewed and extended version of it. The discussed findings can be applied to any actual major version of PostgreSQL.

Primum non nocere

"Primum non nocere" – this is a fundamental principle that is well-known to anyone working in healthcare: "first, do no harm". It is a reminder: when considering any action that is supposed to improve something, we always need to look at the global picture to see if there might be something else that be damaged by the same action.

This is a great principle and it is used not only in healthcare, of course. I strongly believe that it has to be used in database optimization too, and we need better tools to make it happen.

Pavel Stehule: Orafce 3.17.0

$
0
0
I released Orafce 3.17.0. This is bugfix only release. Giles Darold wrote a patch that fixes orafce's regexp functions for NULL arguments.

Andreas 'ads' Scherbaum: Tatsuro Yamada

$
0
0
PostgreSQL Person of the Week Interview with Tatsuro Yamada: I was born in Hokkaido, a prefecture in the north part of Japan, and I live near Tokyo now. I work as an in-house database engineer at NTT Comware, providing technical support, product verification, and human resource education for PostgreSQL. And I’m contributing to the PostgreSQLcommunity by developing useful features for DBAs, and by reporting problems.

cary huang: The PostgreSQL Timeline Concept

$
0
0

1. Introduction

In my previous blog here, I discussed about PostgreSQL’s point in time recovery where PostgreSQL supports an ability to recover your database to a specific time, recovery point or transaction ID in the past but I did not discuss in detail the concept of timeline id, which is also important in database recovery.

2. What is timeline ID and Why it is important?

A timeline ID is basically a point of divergence in WAL. It represents a point, of to be exact, the LSN of the WAL in which the database starts to diverge. Divergence happens when an user performs a point in time recovery or when the standby server is promoted. The timeline ID is included in the first 8 bytes of WAL segment files under pg_wal/ directory.

For example:
pg_wal/000000010000000000000001, indicates that this WAL segment belongs to timeline ID = 1

and

pg_wal/000000020000000000000001, indicates that this WAL segment belongs to timeline ID = 2

Timeline ID behaves somewhat like git branch function without the ability to move forward in parallel and to merge back to the master branch. Your development starts from a master branch, and you are able to create a new branch (A) from the master branch to continue a specific feature development. Let’s say the feature also involves several implementation approaches and you are able to create additional branches (B, C and D) to implement each approach.

This is a simple illustration of git branch:

With timeline ID, your database starts from timeline ID 1 and it will stay at 1 for all subsequent database operations. Timeline ID 2 will be created when the user performs a point in time recovery on timeline 1 and all of the subsequnt database operations at this point belong to timeline ID 2. While at 2, the user could perform more PITR to create timeline 3, 4 and 5 respectively.

In the previous PITR blog, I mentioned that you could do PITR based on time, a recovery point, a LSN or a transaction ID but all these can only apply to one particular timeline. In postgresql.conf, you can select a desired recovery timeline by the recovery_target_timeline parameter. This parameter can be 'latest', 'current', or 'a particular timeline ID value'.

With this configuration, an user is able to recovery the database to a particular point of a particular timeline in the past

This is a simple illustration of timeline ID:

3. The History File Associated with a Timeline ID

The history files are created under pg_wal/ directory with a .history postfix when a new timeline Id is created. This file describes all the past divergence points that have to be replayed in order to reach the current timeline. Without this file, it is impossible to tell where a timeline comes from, thus not being able to do PITR.

For example, a history file 00000003.history may contain the following contents

cat pg_wal/00000003.history
1       0/30000D8       no recovery target specified

2       0/3002B08       no recovery target specified

which means that timeline 3 comes from LSN(0/3002B08) of timeline 2, which comes from LSN(0/30000D8) of timeline 1.

4. Importance of Continuous Archiving

With the concept of timeline ID, it is possible that the same LSN or the same WAL segments exist in multiple timelines.

For example: the WAL segments, 3, 4, 5, 6 exist in both timeline 1 and timeline 2 but with different contents. Since the current timeline is 2, so the ones in timeline 2 will continue to grow forward.

000000010000000000000001
000000010000000000000002
000000010000000000000003
000000010000000000000004
000000010000000000000005
000000010000000000000006
000000020000000000000003
000000020000000000000004
000000020000000000000005
000000020000000000000006
000000020000000000000007
000000020000000000000008

With more timelines created, the number of WAL segments files may also increase. Sine PG keeps a certain amount of WAL segment files before deleting them, it is super important to archive all the WAL segments to a separate location either by enabling continuous archiving function or using pg_receivewal tool. With all WAL segment files archived in a separate location, the user is able to perform successful point in time recovery to any timeline and any LSN.

Cary is a Senior Software Developer in HighGo Software Canada with 8 years of industrial experience developing innovative software solutions in C/C++ in the field of smart grid & metering prior to joining HighGo. He holds a bachelor degree in Electrical Engineering from University of British Columnbia (UBC) in Vancouver in 2012 and has extensive hands-on experience in technologies such as: Advanced Networking, Network & Data security, Smart Metering Innovations, deployment management with Docker, Software Engineering Lifecycle, scalability, authentication, cryptography, PostgreSQL & non-relational database, web services, firewalls, embedded systems, RTOS, ARM, PKI, Cisco equipment, functional and Architecture Design.

The post The PostgreSQL Timeline Concept appeared first on Highgo Software Inc..

Lukas Fittl: How we deconstructed the Postgres planner to find indexing opportunities

$
0
0
Everyone who has used Postgres has directly or indirectly used the Postgres planner. The Postgres planner is central to determining how a query gets executed, whether indexes get used, how tables are joined, and more. When Postgres asks itself "How do we run this query?”, the planner answers. And just like Postgres has evolved over decades, the planner has not stood still either. It can sometimes be challenging to understand what exactly the Postgres planner does, and which data it bases its…
Viewing all 9634 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>