Eric Hanson: Aquameta Chapter 3: event - The Atoms of Change

March 20, 2016, 5:36 pm

≫ Next: Andreas Scherbaum: March South Bay PostgreSQL Meetup at Adobe

≪ Previous: Chris Travers: When PostgreSQL Doesn't Scale Well Enough

Aquameta is a web development platform built entirely in PostgreSQL. This is chapter three (introduction, meta, file system) of our rollout of Aquameta's architecture.

This chapter is about Aquameta's event module, which lets developers select particular data changes they would like to watch, subscribe to a feed of matching events, and be notified when they happen. It uses PostgreSQL's LISTEN/NOTIFY system to send notification to any listening users.

Goals

If you've been following the previous chapters, you already know that Aquameta is designed on our first principle of datafication, rethinking each layer in stack as relational data. We start to see the fruits of our labor and the benefits of systemic consistency here in the events module.

In a "traditional" web stack in 2016, there are different kinds of event systems throughout the stack's various layers. Git has git hooks for commit events, the file system has inotify for file change events, application-level events can use something like Django signals, we might have a message queue like celery for general purpose message passing. Among others.

Our goal with event is to do all of the above with one system. Because of Aquameta's first principle of datafication, the idea is that any change that can possibly happen in Aquameta is some kind of data change.

We'll use event in the future to keep the DOM in sync with the database, handle pub/sub communication of data change events, and build more advanced coding patterns in the spirit of "live coding".

Relational diff

To understand what a data change event is, let's start with a simple data set and make some changes to it:

person
id	name	score
1	Joe Smith	15
2	Don Jones	12
3	Sandy Hill	16
4	Nancy Makowsky	9

Now imagine running the following SQL to change the data:

insert into person (name, score) values ('Don Pablo', 14);  
update person set name='Sandy Jones', score=score+3 where id=3;  
delete from person where id=4;

After the changes:

person table - after change
id	name	score
1	Joe Smith	15
2	Don Jones	12
3	Sandy Jones	19
5	Don Pablo	14

Here's what you might call a "relational diff", highlighting the difference between the two tables:

person table - inclusive difffd
id	name	score
1	Joe Smith	15
2	Don Jones	12
3	Sandy Jones	19
4	Nancy Makowsky	9
5	Don Pablo	14

Aquameta's event model for data changes builds on the observation that we can express the "diff" between any two database states as a collection of operations of precisely three types:

change type	arguments
row_insert	relation_id, row data
delete row	row_id
update field	field_id, new value

In this frame, we can express the "delta" between these two tables as a set of these operations:

SQL command	change type	arguments
`insert into person (name, score) values ('Don Pablo', 14);`	row_insert	public.person, { id: 5, name: Don Pablo, score: 14 }
`delete from person where id=4;`	row_delete	public.person.4
`update person set name='Sandy Jones', score=score+3 where id=3;`	field_update	public.person.3.name, Sandy Jones
	field_update	public.person.3.score, 19

You could imagine a log of changes like the one above going in parallel to the PostgreSQL query log. But rather than logging the commands that have been executed, it logs the resultant changes of those commands. These three simple operations (row_insert, row_delete, field_update) encompass all the ways data can change.

So that covers data changes, but what about schema changes, what some call "migrations"? Say we were to add an age column to the table above:

alter table public.person add column age integer;

In Aquameta, schema changes can be represented as data changes as well, via meta, our writable system catalog. The column could have just as well been created via an insert into the meta.column table:

insert into meta.column (schema_name, relation_name, name, type) values ('public','person','age', 'integer');

So, we can also represent schema changes in our event log:

SQL command	change type	arguments
alter table public.person add column age integer;	row_insert	meta.column, { schema_name: public, relation_name: person, name: age, type: integer }

The event module doesn't yet support schema changes, but it's certainly possible via PostgreSQL's DDL events mechanism.

Example Usage

Ok, let's take a look at the event system in action.

Sessions

To identify where to send events, we use session, an abstract entity that represents one use session, say a browser tab or a cookie session. In Aquameta they are the primary key for persistent state, and can be used across PostgreSQL connections and by multiple connections and roles at the same time. They serve as a kind of inbox for events, among other things. Users create sessions and can detatch and reattach to them, or the web server can create them.

Let's create a new session:

aquameta=# select session_create();  
            session_create
--------------------------------------
 ceb2c0cf-9985-454b-bc79-01706b931a3b
(1 row)

Subscriptions

Once a session has been created, they can subscribe to data changes at various levels of granularity, an entire table, just one specific row, or just one specific field. Here's the API:

event.subscribe_table( meta.relation_id ) - generates row_insert, row_delete, field_change
event.subscribe_row( meta.row_id ) - generates field_change, row_delete
event.subscribe_field( meta.field_id ) - generates field_change, row_delete
event.subscribe_column( meta.column_id ) - generates field_change, row_delete, row_insert

aquameta=# select  subscribe_table(meta.relation_id('widget','machine'));  
           subscribe_table            
--------------------------------------
 ac944107-7679-4987-919b-9f3f39cfdf70
(1 row)

Then events come through via PostgreSQL NOTIFY messages:

aquameta=# insert into widget.machine values (DEFAULT);  
INSERT 0 1  
Asynchronous notification "92841351-8c73-4548-a801-e89c626b9ec0" with payload "{"operation" : "insert", "subscription_type" : "table", "row_id" : {"pk_column_id":{"relation_id":{"schema_id":{"name":"widget"},"name":"machine"},"name":"id"},"pk_value":"70e63984-1b70-4324-b5f1-6b6efca09169"}, "payload" : {"id":"70e63984-1b70-4324-b5f1-6b6efca09169"}}" received from server process with PID 22834.  
aquameta=#

Conclusion

Together, it's a dead simple data change event system. It is highly general purpose, because it's positioned immediately atop our first principle of datafication. Everything that we'll build in Aquameta further up the stack can have a consistent and uniform event model.

↧

Andreas Scherbaum: March South Bay PostgreSQL Meetup at Adobe

March 11, 2016, 4:54 pm

≫ Next: Hubert 'depesz' Lubaczewski: Waiting for 9.6 – Tsvector editing functions

≪ Previous: Eric Hanson: Aquameta Chapter 3: event - The Atoms of Change

Andreas 'ads' Scherbaum

Adobe hosted the last South Bay PostgreSQL Meetup in San Jose earlier this month, and we had a nice speaker lineup.

The two main talks:

We also had three Lightning Talks:

Elein Mustain: Database of Databases
Ozgun Erdogan: Session Analytics
Venkatesh Raghavan: Orca and PostgreSQL

↧

Hubert 'depesz' Lubaczewski: Waiting for 9.6 – Tsvector editing functions

March 21, 2016, 11:33 am

≫ Next: Robert Haas: Parallel Query Is Getting Better And Better

≪ Previous: Andreas Scherbaum: March South Bay PostgreSQL Meetup at Adobe

On 11th of March, Teodor Sigaev committed patch: Tsvector editing functions Adds several tsvector editting function: convert tsvector to/from text array, set weight for given lexemes, delete lexeme(s), unnest, filter lexemes with given weights Author: Stas Kelvich with some editorization by me Reviewers: Tomas Vondram, Teodor Sigaev For those that don't know tsvector […]

↧

Robert Haas: Parallel Query Is Getting Better And Better

March 21, 2016, 4:25 pm

≫ Next: David Rowley: Parallel Aggregate – Getting the most out of your CPUs

≪ Previous: Hubert 'depesz' Lubaczewski: Waiting for 9.6 – Tsvector editing functions

Back in early November, I reported that the first version of parallel sequential scan had been committed to PostgreSQL 9.6. I'm pleased to report that a number of significant enhancements have been made since then. Of those, the two that are by the far the most important are that we now support parallel joins and parallel aggregation - which means that the range of queries that can benefit from parallelism is now far broader than just sequential scans.

Read more »

↧

David Rowley: Parallel Aggregate – Getting the most out of your CPUs

March 21, 2016, 8:31 am

≫ Next: Hubert 'depesz' Lubaczewski: Waiting for 9.6 – Add simple VACUUM progress reporting.

≪ Previous: Robert Haas: Parallel Query Is Getting Better And Better

A small peek into the future of what should be arriving for PostgreSQL 9.6.

Today PostgreSQL took a big step ahead in the data warehouse world and we now are able to perform aggregation in parallel using multiple worker processes! This is great news for those of you who are running large aggregate queries over 10’s of millions or even billions of records, as the workload can now be divided up and shared between worker processes seamlessly.

We performed some tests on a 4 CPU 64 core server with 256GB of RAM using TPC-H @ 100 GB scale on query 1. This query performs some complex aggregation on just over 600 million records and produces 4 output rows.

The base time for this query without parallel aggregates (max_parallel_degree = 0) is 1375 seconds. If we add a single worker (max_parallel_degree = 1) then the time comes down to 693 seconds, which is just 6 seconds off being twice as fast! So quite close to linear scaling. If we take the worker count up to 10 (max_parallel_degree=10), then the time comes down to 131 seconds, which is once again just 6 seconds of perfect linear scaling!

The chart below helps to paint a picture of this. The blue line is the time in seconds. Remember that the time axis is on a logarithmic scale, (the performance increase is a little too much to see the detail at higher worker counts otherwise)

You can see that even with 30 worker processes we’re still just 20% off of the linear scale. Here the query runs in 56 seconds, which is almost 25 times faster than the non-parallel run.

This is the first really effective use case of the recent parallel query infrastructure.

More work is still to be done to parallel-enable some aggregate functions, but today’s commit is a great step forward.

↧

Hubert 'depesz' Lubaczewski: Waiting for 9.6 – Add simple VACUUM progress reporting.

March 21, 2016, 5:57 pm

≫ Next: Hubert 'depesz' Lubaczewski: Waiting for 9.6 – Add idle_in_transaction_session_timeout.

≪ Previous: David Rowley: Parallel Aggregate – Getting the most out of your CPUs

On 15th of March, Robert Haas committed patch: Add simple VACUUM progress reporting. There's a lot more that could be done here yet - in particular, this reports only very coarse-grained information about the index vacuuming phase - but even as it stands, the new pg_stat_progress_vacuum can tell you quite a bit about what […]

↧

Hubert 'depesz' Lubaczewski: Waiting for 9.6 – Add idle_in_transaction_session_timeout.

March 22, 2016, 2:19 am

≫ Next: Devrim GÜNDÜZ: 9.6, or 10.0?

≪ Previous: Hubert 'depesz' Lubaczewski: Waiting for 9.6 – Add simple VACUUM progress reporting.

On 16th of March, Robert Haas committed patch: Add idle_in_transaction_session_timeout. Vik Fearing, reviewed by Stéphane Schildknecht and me, and revised slightly by me. This is something amazing. I hate “idle in transaction" connections. These cause lots of problems, and while are generally easy to solve, in times where we have orms, web frameworks, various […]

↧

Devrim GÜNDÜZ: 9.6, or 10.0?

March 22, 2016, 6:18 am

≫ Next: Rajeev Rastogi: Native Compilation Part-2 @ PGDay Asia 2016 Conference

≪ Previous: Hubert 'depesz' Lubaczewski: Waiting for 9.6 – Add idle_in_transaction_session_timeout.

Even though PostgreSQL 9.5 was released not too long ago, we are almost getting closer towards the feature freeze for 9.6.

When Windows support and PITR capabilities were added to 7.5, the community decided to change the major version to 8.0, as it represents a significant changes at those times. Similar thing happened at 8.5 development cycle: We now had in core replication, and it was released as 9.0.

Now, as 9.6 is coming closer, I seriously think that we should release 10.0, not 9.6 (I've been ranting about this on my Twitter account for a while). Following the commit of Parallel sequential scan , subsequent commits for:

- Parallel Joins
- Parallel Aggregation

increased parallelism features.

Another big infrastructure change is Tom's patch on "Making the upper part of the planner work by generating and comparing Paths."

Apart from these, following the commit that changes the VM file format, another commit titled
Don't vacuum all-frozen pages will help us to deploy PostgreSQL in some environments that we could not do before, I think.

To keep it short, I think that these infrastructural changes should result in a .0 release, as we did before.

What do you think?a

↧

Rajeev Rastogi: Native Compilation Part-2 @ PGDay Asia 2016 Conference

March 22, 2016, 7:03 am

≫ Next: US PostgreSQL Association: Why PgConf.US, 20 years of PostgreSQL -- That's why!

≪ Previous: Devrim GÜNDÜZ: 9.6, or 10.0?

Finally first PGDay Asia held in Singapore has successfully finished. I am also one of the organizer of this conference. It took months of hard work by many specially Sameer Kumar to achieve this.

I had also opportunity to present my paper on Native Compilation part-2, it was subsequent to my paper presented in last PGCon 2015 (Procedure compilation is additional).

Summary of this talk (For detail talk visit Native Compilation at PGDay Asia 2016):

Why Native Compilation Required:

Seeing the current hardware trend, I/O is no more bottleneck for the executor mainly because of below reasons:

1. Increase in RAM Size

2. Prevalence of High speed SSD

At the same time, though lot of work has happened in hardware but there is no focus on improving the CPU efficiency. Because of this, current executor biggest bottleneck is CPU usage efficiency not I/O. So in order to tackle CPU bottleneck without compromising on any feature, we need mechanism to execute lesser number of instruction and still getting all functionality.

What is Native Compilation:

Native Compilation is a methodology to reduce CPU instructions by executing only instruction specific to given query/objects unlike interpreted execution. Steps are:

1. Generate C-code specific to objects/query.

2. Compile C-code to generate DLL and load with server executable.

3. Call specialized function instead of generalized function.

Native compilation can be applied to any entity which are inefficient in terms of CPU e.g.:

1. Table

2. Procedure

3. Query

Table Native Compilation (Aka Schema Binding):

Since most of the properties of a particular table remains same once it is created, so its data gets stored and accessed in the similar pattern irrespective of any data. So instead of accessing tuples for same relation in generic way, we create a specialized access function for the particular relation during its creation and the same gets used for further query containing that table. This approach eliminates the need for various comparison (e.g. data-type check, length check, number of attribute check) during the access of tuple. More the deforming of tuples contribute to total run of query, more performance will be observed. Generation of code can be done in 3-ways, details of each of them is available in presentation.

Fig-1: Table compilation

Once a create table command is issued, a C-file with all specialized access function is generated, which is in turns gets loaded as DLL. These loaded functions are used by all SQL query accessing the compiled table.

Performance Improvement:

Below is the performance improvement observed on standard TPC-H benchmark.

The system configuration is as below:

SUSE Linux Enterprise Server 11 (x86_64), 2 Cores, 10 sockets per core

TPC-H Configuration: Default

Graph-1: TPC-H Performance

Table-1: TPC-H Performance

With the above performance result, we observed that there is more than 70% reduction in CPU instruction and upto 36% performance improvement on TPC-H queries.

Procedure Compilation:

There are two parts in any of the procedure:

1. SQL Query statements

2. Non-SQL statement (e.g. loop, conditional, simple expression etc).

In this presentation, we focused only on 2nd part. As part of this whenever we create the procedure, we compile the Pl/Pgsql function into corresponding normal C-function. While compiling:

1. Transform the Pl/Pgsql procedure signature to normal C-function signature.

2. Transform all variable declaration as C-variable declaration.

3. Transform all other non-SQL statement in corresponding normal C-statement.

4. All SQL statements are transformed in the format SPI_xxx as below:

if (plan == NULL)

{

stmt = SPI_prepare(modified_query, number_of_variables, type_of_variable used);

SPI_keepplan(stmt);

plan = stmt;

}

SPI_execute_plan(plan, value, isnull, false, row_count);

5. Finally compile the generated C-function using a traditional compiler to generate DLL and link the same with SERVER executable.

6. So on subsequent execution, directly C-function will be called instead of Pl/Pgsql function.

Fig-2: Procedure Compilation

Fig-3: Compiled Procedure invocation

Fig-2 highlights at what step and how procedure will be compiled. Once the parsing of procedure is done, we will have all information about the procedure in PLpgSQL_function pointer, which we can traverse as planner does and for each statement corresponding C-code can be generated.

Fig-3 explain, how compiled function will be invoked.

How number of instruction reduced:

Consider a statment as x=2+y in Pl/Pgsql procedure. This statement will be executed as a traditional SELECT command as "SELECT 2+y;". So everytime this procedure gets executed, this query will be executed as if it was normal SQL query and hence many CPU instruction are executed.

But if we convert this to C-based statment, it execution will be directly as 2+y and hence will execute very few instruction comapred to original one.

Performance Improvement:

Below is the performance improvement observed on standard TPC-C benchmark:

The system configuration is as below:

SUSE Linux Enterprise Server 11 (x86_64), 2 Cores, 10 sockets per core

TPC-C Configuration: runMins=0, runTxnsPerTerminal=200000

Checkpoint_segment = 100

Graph-2: TPC-C Performance

Table-2: TPC-C Performance

With some basic compilation of procedure, we are able to get around 23% performance improvement.

Query Compilation:

Though this was not part of my presentation but would like to give some details to do this.

Unlike above two compilation, in order to compile query we need to use LLVM as the compilation time is very important while executing the query.

LLVM is open source compilation framework for very efficient and fast compilation. The LLVM compiler can only understand the IR (Internal representation something similar to assembly code). So before compilation using LLVM, we need to generate IR form of the code, which can be either of the two ways:

1. Generate C-code and generate IR for this using Clang (a tool built for C-code)

2. Or just use the IR-Builder interface function provided by LLVM to generate code directly in IR format.

The steps to compile query is as below:

1. Generate the Query plan as it is.

2. Choose the plan node to be compiled (if whole query to be compiled, then all nodes), generate a specific code corresponding to the node. This can be done in either of the ways explained above.

3. Compile the generated code using LLVM.

4. The code compiled using LLVM will return an object handler, which can be attached to the corresponding node.

5. So while executing that node, we can call compiled stored object if the compilation object stored for the corresponding node.

The most of the benefit is expected out of this query compilation. Once query compilation is done, even the SQL query statement of procedure also can be compiled in the same way (perhaps nothing explicit will be required for this...).

Conclusion:

Seeing the industry trend, we have implemented two way of specialization, which resulted in up to 36% and 23% of performance improvement on standard benchmark TPC-H and TPC-C respectively.

Any feedback/comments/queries are welcome.

↧

US PostgreSQL Association: Why PgConf.US, 20 years of PostgreSQL -- That's why!

March 22, 2016, 10:31 am

≫ Next: Hubert 'depesz' Lubaczewski: Waiting for 9.6 – Directly modify foreign tables.

≪ Previous: Rajeev Rastogi: Native Compilation Part-2 @ PGDay Asia 2016 Conference

A conference specifically engineered to bring together community, developers, users, and the companies that support PostgreSQL, PgConf US is the conference for PostgreSQL in North America. When you attend you will be surrounded by the best and brightest that our community has to offer.

↧

Hubert 'depesz' Lubaczewski: Waiting for 9.6 – Directly modify foreign tables.

March 22, 2016, 10:53 am

≫ Next: Bruce Momjian: Oracle Attacks Postgres in Russia

≪ Previous: US PostgreSQL Association: Why PgConf.US, 20 years of PostgreSQL -- That's why!

On 18th of March, Robert Haas committed patch: Directly modify foreign tables. postgres_fdw can now sent an UPDATE or DELETE statement directly to the foreign server in simple cases, rather than sending a SELECT FOR UPDATE statement and then updating or deleting rows one-by-one. Etsuro Fujita, reviewed by Rushabh Lathia, Shigeru Hanada, Kyotaro […]

↧

Bruce Momjian: Oracle Attacks Postgres in Russia

March 22, 2016, 12:15 pm

≫ Next: Hubert 'depesz' Lubaczewski: Waiting for 9.6 – Support parallel aggregation.

≪ Previous: Hubert 'depesz' Lubaczewski: Waiting for 9.6 – Directly modify foreign tables.

During my twenty years with Postgres, I knew the day would come when proprietary databases could no longer ignore Postgres and would start attacking us.

Well, that day has come, at least in Russia. During the past few weeks, Oracle sent a letter (Russian, English translation) to Russian partners and customers comparing Oracle favorably to Postgres as a way of cirumventing a new law favoring Russian-produced software. This is the first direct attack I have seen on Postgres, and is probably representative of the kinds of attacks we will see from other vendors and in other countries in the years to come.

The press has picked up on the news (Russian, English) and given balanced coverage. Comments on the English article were, in general, positive — I particularly liked this one. There are two Hacker News threads about it (1, 2), a community thread about it, and another community thread about Oracle RAC.

↧

Hubert 'depesz' Lubaczewski: Waiting for 9.6 – Support parallel aggregation.

March 23, 2016, 11:53 am

≫ Next: Ozgun Erdogan: Citus Unforks From PostgreSQL, Goes Open Source

≪ Previous: Bruce Momjian: Oracle Attacks Postgres in Russia

On 21st of March, Robert Haas committed patch: Support parallel aggregation. Parallel workers can now partially aggregate the data and pass the transition values back to the leader, which can combine the partial results to produce the final answer. David Rowley, based on earlier work by Haribabu Kommi. Reviewed by Álvaro Herrera, Tomas […]

↧

Ozgun Erdogan: Citus Unforks From PostgreSQL, Goes Open Source

March 24, 2016, 7:12 am

≫ Next: Gianni Ciolli: PostgreSQL User Group NL

≪ Previous: Hubert 'depesz' Lubaczewski: Waiting for 9.6 – Support parallel aggregation.

Citus Unforks From PostgreSQL, Goes Open Source

When we started working on CitusDB 1.0 four years ago, we envisioned scaling out relational databases. We loved Postgres (and the elephant) and picked it as our underlying database of choice. Our goal was to extend this database to seamlessly shard and replicate your tables, provide high availability in the face of failures, and parallelize your SQL queries across a cluster of machines.

We wanted to make the PostgreSQL elephant magical.

Four years later, CitusDB has been deployed into production across a number of verticals, and received numerous feature improvements with every release. PostgreSQL also became much more extensible in that time–and we learned a lot more about it.

Today, we’re happy to release Citus 5.0, which seamlessly scales out PostgreSQL across a cluster of machines for real-time workloads . We’re also excited to share two major announcements in conjunction with the 5.0 release!

First, Citus 5.0 now fully uses the PostgreSQL extension APIs. In other words, Citus becomes the first distributed database in the world that doesn't fork the underlying database. This means Citus users can immediately benefit from new features in PostgreSQL, such as semi-structured data types (json, jsonb), UPSERT, or when 9.6 arrives no more full table vacuums. Also, users can keep working with their existing Postgres drivers and tools.

Second, Citus is going open source! The project, codebase, and all open issues are now available on Github. We realized that when mentioned Citus and PostgreSQL together to prospective users, they already assumed that Citus was open. After many conversations with our customers, advisors, and board, we are happy to make Citus available for everyone.

To see how to get started with it, let’s take a hands-on look.

Getting Rolling

You can download the extension here. Once you’ve downloaded it you can bootstrap your initial Postgres database with the Citus cluster. From here we can begin using Citus:

CREATE EXTENSION citus;

Now that the extension is enabled you can begin taking advantage. First we’ll create a table, then we’re going to tell Citus to create it as a distributed table and finally we’ll inform it about our shards. If you’re running Citus on a single machine, this will scale queries across multiple CPU cores. and create the impression of sharding across databases.

As an example, which you can find more detail on in our tutorial, we’re going to create a table to capture edits from wikipedia, then shard this table across multiple Postgres instances. First let’s create our table:

CREATE TABLE wikipedia_changes (
  editor TEXT, -- The editor who made the change
  time TIMESTAMP WITH TIME ZONE, -- When the edit was made
  bot BOOLEAN, -- Whether the editor is a bot

  wiki TEXT, --  Which wiki was edited
  namespace TEXT, -- Which namespace the page is a part of
  title TEXT, -- The name of the page

  comment TEXT, -- The message they described the change with
  minor BOOLEAN, -- Whether this was a minor edit (self-reported)
  type TEXT, -- "new" if this created the page, "edit" otherwise

  old_length INT, -- How long the page used to be
  new_length INT -- How long the page is as of this edit
);

Now that we’ve created our table we’re going to tell the Citus extension this is the one we want to shard. *In the case of our demo, we’re going to lower the replication factor to one, since we’re only running 1 worker node*

SET citus.shard_replication_factor = 1;
SELECT master_create_distributed_table( 'wikipedia_changes', 'editor', 'hash' );
SELECT master_create_worker_shards('wikipedia_changes', 16, 1);

You can start inserting data with a standard INSERT INTO and Citus will shard and distribute your data across multiple nodes. If you want a jump start at loading data in check out our tutorial with scripts to help you start loading data automatically from the wikipedia event stream.

It’s that simple to use the fully open source Citus 5.0 extension. Now, let’s take a deeper look at some of the technical details.

What’s Unique about Citus?

Citus uses three new ideas when building the distributed database.

Citus scales out SQL by extending PostgreSQL, not forking it. This way, users benefit from all the performance and feature work done on Postgres over the past two decades, scaled out on a cluster of machines.
Data-intensive applications have evolved over time to require multiple workloads from the database. Citus comes with three distributed executors, recognizing differences across operational (low-latency) and analytic (high-throughput) workloads
Parallelizing SQL queries requires that the underlying theoretical framework is complete. Citus’ distributed query planner uses multi-relational algebra, which is proven to be complete.

These principles help us lay the foundation for a scalable relational database. With that said, we know that we still have more work ahead of us. PostgreSQL is huge, and Citus currently doesn’t support the full spectrum of SQL queries. For details on SQL coverage, please see our FAQ.

A good way to get started with Citus today is to think of it in terms of your use-case.

Common Use Cases

Citus provides users real-time responsiveness over large datasets, most commonly seen in rapidly growing event systems or with time series data . Common uses include powering real-time analytic dashboards, exploratory queries on events as they happen, session analytics, and large data set archival and reporting.

Citus is deployed in production across multiple verticals, ranging from technology start-ups to large enterprises. Here are some examples:, ranging from technology start-ups to large enterprises. Here are some examples:

CloudFlare uses Citus to provide real-time analytics on 100 TBs of data from over 4 million customer websites.
Neustar builds and maintains a scalable ad-tech infrastructure that analyzes billions of events per day using HyperLogLog and Citus.
Agari uses Citus to secure more than 85 percent of U.S. consumer emails on two 6-8 TB clusters.
Heap uses Citus to run dynamic funnel, segmentation, and cohort queries across billions of users and tens of billions of events.

As excited as we are to make Citus 5.0 available to everyone, we’d be remiss to not pay attention to those of you who need something more. For customers with large production deployments, we also offer an enterprise edition that comes with additional functionality and commercial support.

In Conclusion

We’re excited to release the latest version of Citus and make it open source. And we’d love to hear your feedback. If you have questions or comments for us, (start a thread)[XX] in our Google Group, join us through the Citus IRC channel, or open an issue on Github.

↧

Gianni Ciolli: PostgreSQL User Group NL

March 24, 2016, 9:29 am

≫ Next: Kevin Grittner: Home on the BRIN Range in PostgreSQL 9.5

≪ Previous: Ozgun Erdogan: Citus Unforks From PostgreSQL, Goes Open Source

Last week I was invited by the Dutch PostgreSQL User Group in Amsterdam to speak on PostgreSQL Administration Recipes. It was the first session of 2016, and the third since they started meeting last year.

PostgreSQL Administration Recipes

Using the simple format of a cooking recipe, I presented some techniques that a PostgreSQL DBA can use to solve recurring problems, for instance: change your password without leaving traces of the new password around; quickly estimate the number of rows in a table; see which parameters have a non-default setting; temporarily disable an index without dropping it. The last recipe was more philosophical: plan your backups, or better yet, plan your recovery!

After my talk I listened to the interesting presentation from Reiner Peterke who described pg_inside, his tool for collecting PostgreSQL statistics.

Reiner Peterke presenting pg_inside

I was pleased to find a lively audience who asked some interesting questions. I was also asked to publish my slides, which are now available here.

I shall close with a historic note: as far as I know, this is the first community event organised in the Netherlands since the 2011 European PostgreSQL conference in Amsterdam. I was one of the speakers back then and I have fond memories; most of all, my lightning talk on Barman anticipating the ongoing work of my colleagues. That was indeed the very first public mention of an outstanding Disaster Recovery tool, which is now extremely popular.

I also recall meeting several local PostgreSQL users at that conference, and this is why I am particularly glad to join the effort of growing the Dutch PostgreSQL community. My thanks go to the organisers (Coen Hamers and Jolijn Bos from Splendid Data) and to IBM who hosted the event in their space inside B.Amsterdam, for creating a great opportunity to discuss The World’s Most Advanced Open Source Database.

PostgreSQL User Group NL

↧

Kevin Grittner: Home on the BRIN Range in PostgreSQL 9.5

March 24, 2016, 10:36 am

≫ Next: US PostgreSQL Association: PgConf.US and the community pavilion!

≪ Previous: Gianni Ciolli: PostgreSQL User Group NL

There is a new index type in PostgreSQL version 9.5 called BRIN, for "Block Range Index". BRIN is an exciting feature for the 9.5 release because it can quickly organize very large tables for analysis and reporting, moving PostgreSQL deeper into the big data space.

BRIN is useful where indexed data naturally tends to be grouped or ordered in the table's heap. This can happen naturally, for example, when a column holds a sequentially assigned number or a date. In tables created for reporting or analysis, ordering can be arranged by loading in a sorted order or using the CLUSTER command.

BRIN indexes can be built quickly, are very compact, and can deliver much better query performance than unindexed data. It’s important to note, however, that a BTREE index (if you can tolerate the cost of building and maintaining it) usually gives faster lookup speeds.

While the BRIN index access method is extensible enough that it has multiple uses (e.g., a bloom filter), this post will focus on simple ranges and containment for core data types. A range can be used if the type has a linear order: dates, numbers, text strings, etc. Containment can be used for such things as geometric shapes or IP/CIDR data. For applicable core data types BRIN support is provided "out of the box" -- you can simply create a BRIN index without any extra steps.

The basic concept is that each index entry, rather than specifying a specific value and pointing to a single data row, specifies a range of values and points to a range of heap pages. The number of pages defaults to 128 (1MB using the default block size), although you can override that during index creation. The default seems to be pretty effective for the test cases I've tried; I expected to see a slight performance boost by setting it to one for some tests, but found it to be a little slower -- probably because of the additional index entries that needed to be processed. Smaller ranges will also tend to take more space and generate more maintenance overhead if the table is modified.

For any table with a BRIN index, if rows are added a VACUUM becomes especially important, since new BRIN index entries are only created during execution of the CREATE INDEX command, vacuuming, or execution of the new brin_summarize_new_values() function. Any rows in new page ranges must be scanned for every BRIN index scan until one of these operations creates new entries for those new page ranges. New or updated rows in indexed ranges will expand range information for already-existing BRIN index entries as needed.

Also note that if ranges "contract", the index entries are not dynamically maintained; the REINDEX command can be used to improve the efficiency of an index which has had a lot of such churn.

Here is a simple example of how a BRIN index can combine with a BTREE index for reporting.

create table t2 (d date not null, i int not null);
insert into t2
  (
    select y.d, floor(random() * 10000)
      from (select generate_series(timestamp '2001-01-01',
                                   timestamp '2020-12-31',
                                   interval '1 day')::date) y(d)
      cross join (select generate_series(1, 1000)) x(n)
  );
create index t2_d_brin on t2 using brin (d);
create index t2_i_btree on t2 (i);
vacuum analyze;
explain select count(*) from t2
  where d between date '2016-05-01' and date '2016-05-15'
    and i between 300 and 350;

The plan looks like this:

  Aggregate  (cost=1240.18..1240.19 rows=1 width=0)
   ->  Bitmap Heap Scan on t2  (cost=941.72..1239.99 rows=77 width=0)
         Recheck Cond: ((d >= '2016-05-01'::date) AND (d <= '2016-05-15'::date) AND (i >= 300) AND (i <= 350))
         ->  BitmapAnd  (cost=941.72..941.72 rows=77 width=0)
               ->  Bitmap Index Scan on t2_d_brin  (cost=0.00..163.94 rows=15194 width=0)
                     Index Cond: ((d >= '2016-05-01'::date) AND (d <= '2016-05-15'::date))
               ->  Bitmap Index Scan on t2_i_btree  (cost=0.00..777.49 rows=36906 width=0)
                     Index Cond: ((i >= 300) AND (i <= 350))

Note that a bitmap from a BRIN index and a bitmap from a BTREE index can be combined to limit which pages are read from the heap.

Here is a comparison of the time needed to build the index on the "i" column, the space required, and the query time:

	No index	BRIN index	BTREE index
Time to index	0 seconds	1.95 seconds	11.56 seconds
Index size in pages	0	3	20,033
Query execution time	55 ms	35 ms	13 ms

On a much larger table the benefits would be more dramatic; this example is kept small for the convenience of those who want to play with variations.

BRIN Summary

Advantages

Faster to create and maintain than BTREE indexes
Much smaller than BTREE indexes
Where data is naturally grouped, they may be near the speed of BTREE (for example, date added or a sequence number)
CLUSTER may allow some uses where data doesn't naturally fall into the heap in appropriate groupings
In some cases, will yield many of the benefits of partitioning with less setup effort

Limitations

Not useful If the values are randomly spread across the table
Lookups not as fast as other index types

Armed with these basic techniques for creating BRIN indexes, you can quickly generate compact indexes on very large tables to speed analysis and reporting for big data. The ease and speed of creation make it practical to just try a few new indexes to see whether they help query performance, rather than spending a lot of time analyzing what might help.

↧

US PostgreSQL Association: PgConf.US and the community pavilion!

March 24, 2016, 11:42 am

≫ Next: Andrew Dunstan: Weird stuff happens

≪ Previous: Kevin Grittner: Home on the BRIN Range in PostgreSQL 9.5

PostgreSQL loves community and nowhere is that more obvious than PgConf US. This year PgConf US has instituted a fan favorite, the Community Pavilion. When you combine the most advanced database in the world with some of the best open source technology in the world what do you find?

A powerhouse of interest that brings forth every corner of the technological spectrum to collaborate.

Who are a few of these technological marvels attending the largest and most popular PostgreSQL conference in the United States?

Python via BigApplePy

Node.js

↧

Andrew Dunstan: Weird stuff happens

March 24, 2016, 12:08 pm

≫ Next: solaimurugan vellaipandian: Analyze Data to Identify type of Readmissions using PostgreSQL

≪ Previous: US PostgreSQL Association: PgConf.US and the community pavilion!

Five days ago, my buildfarm animal jacana suddenly started getting an error while trying to extract typedefs. It had been happily doing this for ages, and suddenly the quite reproducible error started. For now I have disabled its typedef analysis, but I will need to get to the bottom of it. It's bad enough to crash the buildfarm client leaving the build directory dirty and not able to process further builds until I clean it up. I'm assuming it's probably something that changed in the source code, as nothing else has changed at all. These are the commits that took place between a good run and the first appearance of the error.

9a83564 Allow SSL server key file to have group read access if owned by root
6eb2be1 Fix stupid omission in c4901a1e.
07aed46 Fix missed update in _readForeignScan().
ff0a7e6 Use yylex_init not yylex_init_extra().
a3e39f8 Suppress FLEX_NO_BACKUP check for psqlscanslash.l.
0ea9efb Split psql's lexer into two separate .l files for SQL and backslash cases.

I don't know for dead certain that any of these has caused an issue, but finding out what the problem is is just one more way for me to spend my copious free time.

↧

solaimurugan vellaipandian: Analyze Data to Identify type of Readmissions using PostgreSQL

March 25, 2016, 12:35 pm

≫ Next: Alexander Korotkov: Monitoring Wait Events in PostgreSQL 9.6

≪ Previous: Andrew Dunstan: Weird stuff happens

Analyze Data to Identify Causes of Readmissions using PostgreSQL

previous post - ERROR invalid input syntax for type date SQL state: 22007

What is patient readmission

Normally Readmission is - if a patient returns within 30 days of previous discharge.

reason for readmission

Multiple factors contribute to avoidable hospital readmission: they may result from poor quality care or from poor transitions between different providers and care settings. The problem of readmission to the hospital is receiving increased attention as a potential way to address problems in quality of care, cost of care, and care transition

readmission table structure

After cleansed and identified the required attribute to get the readmission details, my entity model look like this

patient_id	disease	adm_date	dis_date
1069	Oncology	6/17/2013 6:51	6/21/2013 7:15
1078	Neuro	7/18/2013 12:10	7/20/2013 08:12
1082	Ortho	7/19/2013 12:10	7/22/2013 08:12
1085	Cardiothor	8/25/2013 12:10	8/27/2013 08:12
1085	Cardiothor	9/13/2013 12:10	9/16/2013 08:12

Now, we have to write a query to get the list of readmission details. Condition is, current admit date is between the last discharge date and 30 days from then.

readmission query in postgresql

Select count(*) "Readmission Count", diseaseFrom readmission_view v Where Exists (Select * From readmission_view Where patient_id = v.patient_id and v.adm_date between dis_date and dis_date + 30 )group by disease order by count(*)

readmission result

disease	Readmission Count
Ortho	878
Neuro	567
Oncology	155

Using the result optioned from the query to find the percentage of patient readmission. based on the result need extra care on the disease.

↧

Alexander Korotkov: Monitoring Wait Events in PostgreSQL 9.6

August 26, 2016, 8:00 am

≫ Next: Paul Ramsey: Parallel PostGIS

≪ Previous: solaimurugan vellaipandian: Analyze Data to Identify type of Readmissions using PostgreSQL

Recently Robert Haas committed which allows seeing some more detailed information about current wait event of the process. In particular, user will be able to see if process is waiting for heavyweight lock, lightweight lock (either individual or tranche) or buffer pin. The full list of wait events is available in the documentation. Hopefully, it will be more wait events in further releases.

It’s nice to see current wait event of the process, but just one snapshot is not very descriptive and definitely not enough to do any conclusion. But we can use sampling for collecting suitable statistics. This is why I’d like to present pg_wait_sampling which automates gathering sampling statistics of wait events. pg_wait_sampling enables you to gather statistics for graphs like the one below.

Let me explain you how did I draw this graph. pg_wait_sampling samples wait events into two destinations: history and profile. History is an in-memory ring buffer and profile is an in-memory hash table with accumulated statistics. We’re going to use the second one to see insensitivity of wait events over time periods.

At first, let’s create table for accumulated statistics. I’m doing these experiments on my laptop, and for the simplicity this table will live in the instance under monitoring. But note, that such table could live on the another server. I’d even say it’s preferable to place such data to another server.

CREATETABLEprofile_log(tstimestamp,event_typetext,eventtext,countint8);

Secondly, I wrote a function to copy data from pg_wait_sampling_profile view to profile_log table and clean profile data. This function returns number of rows inserted into profile_log table. Also, this function discards pid number and groups data by wait event. And this is not necessary needed to be so.

CREATEORREPLACEFUNCTIONwrite_profile_log()RETURNSintegerAS<scripttype="math/tex">DECLAREresultinteger;BEGININSERTINTOprofile_logSELECTcurrent_timestamp,event_type,event,SUM(count)FROMpg_wait_sampling_profileWHEREeventISNOTNULLGROUPBYevent_type,event;GETDIAGNOSTICSresult=ROW_COUNT;PERFORMpg_wait_sampling_reset_profile();RETURNresult;END</script>LANGUAGE‘plpgsql’;

And then I run psql session where setup watch of this function. Monitoring of our system is started. For real usage it’s better to schedule this command using cron or something.

smagen@postgres=#SELECTwrite_profile_log();write_profile_log——————-0(1row)</p><p>smagen@postgres=#\watch10FriMar2514:03:092016(every10s)</p><h2id="writeprofilelog">write_profile_log</h2><pre><code>0(1row)

We can see that write_profile_log returns 0. That means we didn’t insert anything to profile_log. And this is right because system is not under load now. Let us create some load using pgbench.

$ pgbench -i -s 10 postgres
$ pgbench -j 10 -c 10 -M prepared -T 60 postgres

In the parallel session we can see that write_profile_log starts to insert some data to profile_log table.

FriMar2514:04:192016(every10s)write_profile_log——————-9(1row)

Finally, let’s examine the profile_log table.

SELECT*FROMprofile_log;ts|event_type|event|count—————————-+—————+——————-+——-2016-03-2514:03:19.286394|Lock|tuple|412016-03-2514:03:19.286394|LWLockTranche|lock_manager|12016-03-2514:03:19.286394|LWLockTranche|buffer_content|682016-03-2514:03:19.286394|LWLockTranche|wal_insert|32016-03-2514:03:19.286394|LWLockNamed|WALWriteLock|682016-03-2514:03:19.286394|Lock|transactionid|3312016-03-2514:03:19.286394|LWLockNamed|ProcArrayLock|82016-03-2514:03:19.286394|LWLockNamed|WALBufMappingLock|52016-03-2514:03:19.286394|LWLockNamed|CLogControlLock|1………………………………………………………………

How to interpret these data? In the first row we can see that count for tuple lock for 14:03:19 is 41. The pg_wait_sampling collector samples wait event every 10 ms while write_profile_log function writes snapshot of profile every 10 s. Thus, it was 1000 samples during this period. Taking into account that it was 10 backends serving pgbench, we can read the first row as “from 14:03:09 to 14:03:19 backends spend about 0.41% of time in waiting for tuple lock”.

That’s it. This blog post shows how you can setup a wait event monitoring of your database using pg_wait_sampling extension with PostgreSQL 9.6. This example was given just for introduction and it is simplified in many ways. But experienced DBAs would easily adopt it for their setups.

P.S. Every monitoring has some overhead. Overhead of wait monitoring was subject of hot debates in mailing lists. This is why features like exposing wait events parameters and measuring each wait event individually are not yet in 9.6. But sampling also has overhead. I hope pg_wait_sampling would be a start point to show on comparison that other approaches are not that bad, and finally we would have something way more advanced for 9.7.

↧