Josh Williams: Switching PostgreSQL WAL-based Backup Options

January 2, 2019, 4:00 pm

≫ Next: Shaun M. Thomas: PG Phriday: PgBouncer or Bust

≪ Previous: Laurenz Albe: pgbouncer authentication made easy

Sunbury hoard
Photo by Paul Hudson· CC BY 2.0, modified

I was woken up this morning. It happens every morning, true, but not usually by a phone call requesting for help with a PostgreSQL database server that was running out of disk space.

It turns out that one of the scripts we’re in the process of retiring, but still had in place, got stuck in a loop and filled most of the available space with partial, incomplete base backups. So, since I’m awake, I’d might as well talk about Postgres backup options. I don’t mean for it to be a gripe session, but I’m tired and it kind of is.

For this particular app, since it resides partially on AWS we looked specifically at options that are able to work natively with S3. We’ve currently settled on pgBackRest. There’s a bunch of options out there, which doesn’t make the choice easy. But I suppose that’s the nature of things these days.

At first we’d tried out pghoard. It looks pretty good on the tin, especially with its ability to connect to multiple cloud storage services beyond S3: Azure, Google, Swift, etc. Having options is always nice. And for the most part it works well, apart from a couple idiosyncrasies.

We had the most trouble with the encryption feature. It didn’t have any problem on the encryption side. But for some reason on the restore the process would hang and eventually fail out without unpacking any data. Having a backup solution is a pretty important thing, but it doesn’t mean anything unless we can get the data back from it. So this was a bit of a sticking point. We probably could have figured out how to get it functioning, and at least been a good citizen and reported it upstream to get it resolved in the source. But we kind of just needed it working, and giving something else a shot is a quicker path to that goal. Sorry, pghoard devs.

The other idiosyncratic behaviors that are probably worth mentioning are that it does its own scheduling. The base backups, for instance, happen at a fixed hour interval in the configuration file, starting from when the service is first spun up. So if I set it to 24 hours, and then service pghoard start at 1 PM, well, that’s the schedule now. So much for having it run overnight. Nor can tell it “create a base backup right now” at other times. And, so far as I can find, there’s no way to override that.

Also rather than using an archive_command, it connects as a streaming replica and pulls WAL files that way. Which is great from a perspective of it not being invasive. But during a load test we generated activity faster than it could write to S3, and hit the classic streaming replica problem of it needing WAL files that had already been removed from the master. It doesn’t handle this condition, though, and just retried and retries after the error, even after another base backup happened. I don’t know that I’d let it run long enough to see if it would eventually give up when the base backup that needed those WAL files had been pushed out by the retention policies. But I eventually killed it and had it make a fresh start. There’s some chatter out there about having it use replication slots, which would be the natural solution for this case, so maybe that’s fixed by now.

So, we switched to something else, and elected pgBackRest next. It’d since been used in another project or two, so it had some bonus points for recently acquired institutional knowledge. It supports storage in S3 (and only S3, but whatevs). And this time we didn’t have to build and distribute deb’s; it’s already available in the Postgres apt repo.

I like things neat and tidy. A base backup I’ve always seen as essentially one unit; it’s the database cluster as a whole (even though I know it’s in a potentially inconsistent state by being spread over an interval of time.) Stored and delivered as a compressed tar file or some such. pgBackRest takes a different view, and instead stores its base backup as separate relation files, essentially as they are on disk, and individually compressed. Which, given it’s something I don’t have to touch and move around manually, is a format that’s growing on me.

Because it does this, pgBackRest brings back a concept I haven’t thought about in a long time: “incremental” or “differential” backups. If a relation (or part of a really big relation, since it’s chopped up into 1 GB sections) doesn’t change between base backups, pgBackRest can skip those and point back to an earlier copy of the file. In most databases (at least I’d argue) this won’t matter so much, as between user activity, background tasks like vacuum, and such, things are frequently changing in those relation files. But in this case, the client has a large table of file blobs that mostly gets appended to. So the data has been loaded, we ran a VACUUM FREEZE on it, and now the daily differential backups avoid a little bit of S3 cost.

Between the entries we have to add to cron and the edit of the archive_command parameter to save WAL pgBackRest takes a couple more steps to get fully in place. But it’s all in configuration management now, so, meh. Besides, the fixed schedule does make it possible to run at a specific time. And the WAL archival is guaranteed not to fall behind and get stuck.

Like pghoard, pgBackRest does encryption. pghoard’s is based on public/private keys, which I might have preferred because the base backups are used to seed development systems, and only distributing the public key to those would be neat. pgBackRest’s is symmetric with a passphrase. Both also do automatic retention maintenance; we have it maintain two weekly full base backups and daily differentials in between. WAL retention is similarly automatic.

The cool thing was we could run both side by side for a while, and confirm that pgBackRest would shake out before disabling pghoard. At least I thought that was cool, until that phone call came through. I kind of wish I knew what got it there in the first place; we found pghoard in a base backup attempt loop, where it’d ended up mostly filling the disk, and just ended up killing it and manually cleaned up its storage to get monitoring back to green.

So I suppose the morals of the story are:

Keep your options open. There are lots of projects out there that do similar things a little differently. Try them out, and don’t be afraid to pivot if something is a little more to your liking.
The things you do try, test out fully. Restore the backups. Let them run for a while to see how they fare long-term and make sure all the assumptions hold up.
I need a second cup of coffee. BRB.

↧

Shaun M. Thomas: PG Phriday: PgBouncer or Bust

January 11, 2019, 9:00 am

≫ Next: Quinn Weaver:

≪ Previous: Josh Williams: Switching PostgreSQL WAL-based Backup Options

What is the role of PgBouncer in a Postgres High Availability stack? What even is PgBouncer at the end of the day? Is it a glorified traffic cop, or an integral component critical to the long-term survival of a Postgres deployment?

When we talk about Postgres High Availability, a lot of terms might spring to mind. Replicas, streaming, disaster recovery, fail-over, automation; it’s a ceaseless litany of architectural concepts and methodologies. The real question is: how do we get from Here to There?

The Importance of Proxies

It’s no secret that the application stack must communicate with the database. Regardless of how many layers of decoupling, queues, and atomicity of our implementation, data must eventually be stored for reference. But where is that endpoint? Presuming that write target is Postgres, what ensures the data reaches that desired terminus?

Consider this diagram:

Two nodes with managed proxy layer — Managed Proxy Layer

In this case, it doesn’t matter what type of Standby we’re using. It could be a physical streaming replica, some kind of logical copy, or a fully configured BDR node. Likewise, the Failover Mechanism is equally irrelevant. Whether we rely on repmgr, Patroni, Stolon, Pacemaker, or a haphazard collection of ad-hoc scripts, the important part is that we separate the application from the database through some kind of proxy.

Patroni relies on HAProxy and Stolon has its own proxy implementation, but what about the others? Traditionally PgBouncer fills this role. Without this important component, applications must connect directly to either the Primary or post-promotion Standby. If we’re being brutally honest, the application layer can’t be trusted with that kind of responsibility.

But why is that? Simply stated, we don’t know what the application layer is. In reality, an application is anything capable of connecting to the database. That could be the official software, or it could be a report, or a maintenance script, or a query tool, or any number of other access vectors. Which database node are they connecting to, and does it matter?

The proxy layer is one of the rare opportunities as DBAs that we can control the situation, as it is the city where all roads must lead.

VIP, DNS, and Load Balancers

Rather than implement on an additional tier of software, it’s often easier to rely on the old tried-and-true network infrastructure. A virtual IP address for example, requires no extra resources beyond an IP address carved out of a likely liberally allocated internal VPN. DNS is likewise relatively seamless, having command-line manipulation available through utilities like nsupdate.

VIPs unfortunately have a major drawback that may mean they’re inapplicable for failover to a different DC: the assigned ethernet device must be on the same subnet. So if the address is 10.2.5.18, all nodes that wish to use it should be on the 10.2.5.* network. It’s fairly common to have dedicated subnets per Data Center, meaning they can’t share a single VIP. One possible solution to this is to create a subnet that spans both locations, specifically for sharing IP resources.

Another is to use DNS instead. However, this approach may be even worse in the long run. Because name lookups are relatively slow, various levels of caching are literally built into the protocol, and liberally enforced. These caches may be applied at the switch, the drivers, the operating system, a local daemon, and even the application itself. Each one has an independent copy of the cached value, and any changes to a DNS record are only truly propagated when all of these reflect the modification. As a result, the TTL of a DNS record can be a mere fraction of the time it actually takes for all layers to recognize the new target.

During all of this, it would be unsafe to continue utilizing the database layer for risk of split-brain, so applications must be suspended. Clearly that’s undesirable in most circumstances.

Some companies prefer load balancing hardware. This is a panacea of sorts, since such devices act like a VIP without the subnet restriction. Further, these often have programmable interfaces that allow scripts or other software to reconfigure traffic endpoints. This unfortunately relies on extra budgetary constraints that don’t apply to VIP or DNS solutions, making it a resource that isn’t always available.

Software like PgBouncer acts like a virtual approximation of such hardware, with the additional bonus of understanding the Postgres communication protocol. So long as there’s spare hardware, or even a minimally equipped VM, it’s possible to provide decoupled access to Postgres.

Smooth Transitions

One aspect the network-oriented PgBouncer alternatives ignore, is comprehension of the Postgres communication protocol. This is critically important from a High Availability perspective, because it avoids immediately terminating ongoing transactions during manual switches. As a proxy, PgBouncer can react to transaction state, and consequentially avoid interrupting active sessions.

Specifically, version 1.9 of PgBouncer introduced two new features that make this possible where it wasn’t before. It’s now possible to put a server backend into close_needed state. Normally PgBouncer is configured to be either in session mode, where server backends are assigned directly to client connections until they disconnect, or transaction mode, where backends are assigned to new clients after each transaction commit.

In close_needed state, a client that ends its session while in session mode will also close the server backend. Likewise in transaction mode, the server backend is closed and replaced with a new allocation at the end of the current transaction. Essentially we’re now allowed to mark a server backend as stale and in need of replacement at the earliest opportunity without preventing new connections.

Any configuration modification to PgBouncer that affects connection strings will automatically place the affected servers in close_needed state. It’s also possible to manually set close_needed by connecting to the pgbouncer psuedo-database and issuing a RECONNECT command. The implication here is that PgBouncer can be simultaneously connected to Server A and server B without forcing a hard cutover. This allows the application to transition at its leisure if possible.

The secret sauce however, is the server_fast_close configuration parameter. When enabled, PgBouncer will end server backends in close_needed state, even in session mode, provided the current transaction ends. Ultimately this means any in-progress transactions can at least ROLLBACK or COMMIT before their untimely demise. It also means we can redirect database server traffic without interrupting current activity, and without waiting for the transactions themselves to complete.

Previously without these new features, we could only issue PAUSE and RELOAD, and then wait for all connections to finally end of their own accord, or alternatively end them ourselves. Afterward, we could issue RESUME so traffic could reach the new database server target. Now the redirection is immediate, and any lingering transactions can complete as necessary.

This is the power of directly implementing the Postgres client protocol, and it’s something no generic proxy can deliver.

Always Abstract

At this point, it should be fairly evident what options are available. However, we also strongly recommend implementing an abstraction layer of some kind at all times, even when there is only a single database node. Here’s a short list of reasons why:

One node now doesn’t mean one node forever. As needs evolve and the architecture matures, it may be necessary to implement a full High Availability or Disaster Recovery stack.
Upgrades don’t care about your uptime requirements. Cross-version software upgrades and hardware replacement upgrades are principally problematic. Don’t get stuck playing Server Musical Chairs.
Connection pools can be surprisingly necessary. In a world where micro-architectures rule, sending thousands of heterogeneous connections to a single database server could otherwise lead to disaster.

Some of these are possible with VIPs and similar technology, others are limited to the realm of pooling proxies. The important part is that the abstraction layer itself is present. Such a layer can be replaced as requirements evolve, but direct database connections require some kind of transition phase.

Currently only PgBouncer can act as a drop-in replacement for a direct Postgres connection. Software like HAProxy has a TCP mode that essentially masquerades traffic to Postgres, but it suffers from the problem of unceremonious connection disruption on transition. That in itself isn’t necessarily a roadblock, so long as that limitation is considered during architectural planning.

In the end, some may prefer their Postgres abstraction layer to understand Postgres. Visibility into the communication protocol gives PgBouncer the ability to interact with it. Future versions of PgBouncer could, for example, route traffic based on intent, or stack connection target alternates much like Oracle’s TNS Listener.

Yes, it’s Yet Another VM to maintain. On the other hand, it’s less coordination with all of the other internal and external teams to implement and maintain. We’ve merely placed our database black-box into a slightly larger black-box so we can swap out the contents as necessary. Isn’t that what APIs are for, after all?

↧

Quinn Weaver:

January 11, 2019, 2:28 pm

≫ Next: Abdul Yadi: pgqr: a QR Code Generator

≪ Previous: Shaun M. Thomas: PG Phriday: PgBouncer or Bust

In the Bay Area? This Tuesday, 2019-01-15, SFPUG features David Fetter's talk on ASSERTIONs. RSVPs close Monday at noon, so don't hesitate! Thanks to Heap for hosting at their FiDi office.

↧

Abdul Yadi: pgqr: a QR Code Generator

January 12, 2019, 4:23 am

≫ Next: Claire Giordano: 10 Most Popular Citus Data Blog Posts in 2018, ft. Postgres

≪ Previous: Quinn Weaver:

Related with my post: https://abdulyadi.wordpress.com/2015/11/14/extension-for-qr-code-bitmap/. I have repackage the module and available on github: https://github.com/AbdulYadi/pgqr.

This project adds 2 functionality to QR code generator from repository https://github.com/swex/QR-Image-embedded:

In-memory monochrome bitmap construction (1 bit per pixel).
Wrap the whole package as PostgreSQL extension.

This project has been compiled successfully in Linux against PostgreSQL version 11.
$ make clean
$ make
$ make install

On successful compilation, install this extension in PostgreSQL environment
$ create extension pgqr

Function pgqr has 4 parameters:

t text: text to be encoded.
correction_level integer: 0 to 3.
model_number integer: 0 to 2.
scale integer: pixels for each dot.

Let us create a QR Code
$ select pgqr(‘QR Code with PostgreSQL’, 0, 0, 4);
The output is a monochrome bitmap ready for display.

↧

Claire Giordano: 10 Most Popular Citus Data Blog Posts in 2018, ft. Postgres

January 13, 2019, 7:42 am

≫ Next: Achilleas Mantzios: One Security System for Application, Connection Pooling and PostgreSQL - The Case for LDAP

≪ Previous: Abdul Yadi: pgqr: a QR Code Generator

Seasons each have a different feel, a different rhythm. Temperature, weather, sunlight, and traditions—they all vary by season. For me, summer usually includes a beach vacation. And winter brings the smell of hot apple cider on the stove, days in the mountains hoping for the next good snowstorm—and New Year’s resolutions. Somehow January is the time to pause and reflect on the accomplishments of the past year, to take stock in what worked, and what didn’t. And of course there are the TOP TEN LISTS.

Spoiler alert, yes, this is a Top 10 list. If you’re a regular on the Citus Data blog, you know our Citus database engineers love PostgreSQL. And one of the open source responsibilities we take seriously is the importance of sharing learnings, how-to’s, and expertise. One way we share learnings is by giving lots of conference talks (seems like I have to update our Events page every week with new events.) And another way we share our learnings is with our blog.

So just in case you missed any of our best posts from last year, here is the TOP TEN list of the most popular Citus Data blogs published in 2018. Enjoy.

The Postgres 10 feature you didn’t know about: CREATE STATISTICS

BY SAMAY SHARMA | Postgres stores a lot of statistics about your data in order to effectively retrieve results when you query your database. In this post, Samay dives deep into some of the statistics PostgreSQL stores and how you can leverage CREATE STATISTICS to improve query performance when different columns are related.

In Postgres, the EXPLAIN planner collects statistics to help it estimate how many rows will be returned after executing a certain part of the plan, which then influences which type of scan or join algorithm will be used. Before Postgres 10, there wasn’t an easy way to tell the planner to collect statistics which capture the relationship between columns. But since the release of Postgres 10, there’s a feature which is built to solve exactly this problem. And this feature has gotten even better in PostgreSQL 11, too. All about CREATE STATISTICS and how it can help you.

PostgreSQL rocks, except when it blocks: Understanding locks

BY MARCO SLOT | When Marco first published this post, it was a hit right away… and has continued to garner lots of reads, month after month. Why? Because while the open source Postgres database is really good at running multiple operations at the same time, there are some cases in which Postgres needs to block an operation using a lock—and so people want to understand locking behaviors in Postgres. Not all locks are bad of course, some are millisecond and don’t actually hold things up in your database.

This post is all about demystifying locking behaviors. Along with advice on how to avoid common problems, based on all the learnings Marco has accumulated in his work with Citus database users over the years.

Database sharding explained in plain English

BY CRAIG KERSTIENS | Craig’s sharding in plain English post has become a go-to guide for understanding how sharding works for relational databases, and how it can be used to transform a single node database into a distributed one. Based on his experience leading the product team for the Citus Cloud database as a service (plus his Heroku Postgres experience before Citus), Craig found himself explaining how sharding works over and over again. So he finally put pen to paper, to explain sharding once and for all, in plain English. This post gives you an overview of the common misconceptions about sharding, partition keys, and key challenges.

When Postgres blocks: 7 tips for dealing with locks

BY MARCO SLOT | Marco highlights 7 do’s and don’ts that developers face when dealing with Postgres locks. Marco derived these tips from his work with developers building applications on top of Postgres and Citus, building both multi-tenant SaaS applications, and real-time analytics dashboards with time series data. Be sure to follow his advice not to freeze your database for hours with a VACUUM FULL.

Citus and pg_partman: Creating a scalable time series database on Postgres

BY CRAIG KERSTIENS | This post shows how you can leverage native time partitioning in Postgres via pg_partman, in combination with the Citus extension to Postgres. The result: a pretty awesome distributed relational time series database. pg_partman is best in class for improving time partitioning in Postgres—and by using pg_partman and Citus together, you can create tables that are distributed across nodes by ID and partitioned by time on disk.

Multi-tenant web apps with ASP.NET Core and Postgres

BY NATE BARBETTINI | This guest post gives you a step-by-step guide to building and architecting multi-tenant web applications for scale, using a combination of the open source, cross-platform ASP.NET Core framework, the awesome Postgres database, and the Citus extension to Postgres that transforms Postgres into a distributed database.

ASP.NET is used to build web applications and APIs, similar to other popular web frameworks like Express and Django. And it powers one of the biggest Q&A networks on the web: Stack Exchange. We wholeheartedly agree with Nate: it’s never too early to design for scale.

Why the RDBMS is the future of distributed databases

BY MARCO SLOT | Marco has a PhD in distributed systems, so he learned early on in life about the trade-offs between consistency and availability (aka the CAP theorem.) But in distributed systems, there are also trade-offs between latency, concurrency, scalability, durability, maintainability, functionality, operational simplicity, and other aspects of the system. And all these trade-offs have an impact on the features and user experience of applications.

In this post, Marco explains why the RDBMS is the future of distributed databases and how a distributed RDBMS like Citus can lower development costs for application developers. My favorite bit is the section on superpowers.

Three Approaches to PostgreSQL Replication and Backup

BY OZGUN ERDOGAN | Ozgun’s exploration of the three different approaches to PostgreSQL replication and backups got a lot of attention. This post explores streaming replication, volume replication via disk mirroring, WAL logs aka write-ahead logs, plus daily and incremental backups.

And while each PostgreSQL replication method has pros and cons, it turns out that one of the approaches is better in cloud-native environments, giving you both high availability (HA) and the ability to easily bring up or shoot down database nodes.

PostgreSQL 11 and Just In Time Compilation of Queries

BY DIMITRI FONTAINE | Right before PostgreSQL 11 released, Dimitri blogged about this new component in the Postgres execution engine: a JIT expression compiler. And Dim shared compelling benchmark results that show up to 29.31% speed improvements in PostgreSQL 11 using the JIT expression compiler, executing TPC-H Q1 at scale factor 10 in 20.5s—vs. 29s when using PostgreSQL 10.

The new Postgres 11 JIT expression compiler is useful for long-running queries that are cpu-bound, such as queries with several complex expressions (think: aggregates.) Turns out that generating more efficient code that can run natively on the cpu (rather than passing queries through the Postgres interpreter) can sometimes be a good thing. :)

How Postgres is more than a relational database: Extensions

BY CRAIG KERSTIENS | The PostgreSQL Extension APIs have enabled so much innovation in the Postgres database world. Did you know PostGIS is an extension? (Of course you did.) HyperLogLog is an extension, too. And last year we created TopN, an extension to Postgres which Algolia uses. And then of course there’s Citus. :)

So it’s not a surprise when Craig observes that Postgres has shifted from being just a relational database to more of a “data platform”. The largest driver for this shift is Postgres extensions. This post gives you a tour of extensions and how they can transform Postgres into much more than a relational database.

Do you have a favorite? Let us know what type of Citus blog posts you’d like to see more of in 2019

Which one of these posts is your favorite? What type of Citus Data blog posts would you like to see more of in 2019? If you have ideas or feedback, we’d love to hear it: you can always tweet us @citusdata or you can send Craig Kerstiens or me a message on our Citus slack and let us know.

Turns out I don’t actually have a favorite among last year’s Top 10: each time I review one of our new blogs to edit them, I find myself grateful for this talented team I get to work with, and delighted to be learning something new. :)

↧

Achilleas Mantzios: One Security System for Application, Connection Pooling and PostgreSQL - The Case for LDAP

January 14, 2019, 3:54 am

≫ Next: Bruce Momjian: Three Factors of Authentication

≪ Previous: Claire Giordano: 10 Most Popular Citus Data Blog Posts in 2018, ft. Postgres

Traditionally, the typical application consists of the following components:

In this simple case, a basic setup would suffice:

the application uses a simple local authentication mechanism for its users
the application uses a simple connection pool
there is a single user defined for database access

However, as the organization evolves and gets larger more components are added:

more tenant apps or instances of the app accessing the database
more services and systems accessing the database
central authentication/authorization (AA) for all (or most) services
separation of components for easier future scaling

In the above scheme, all concerns are separated into individual components, each component serves a specialized purpose. However, still the connection pool uses a single dedicated database user as in the previous simpler setup we saw above.

Besides the new components, also new requirements arrive:

better fine grained control of what users can do on the database level
auditing
better more useful system logging

We can always implement all three with more application code or more layers in the application, but this is just cumbersome and hard to maintain.

In addition, PostgreSQL offers such a rich set of solutions on the aforementioned areas (security, Row Level Security, auditing, etc) that it makes perfect sense to move all those services to the database layer. In order to take those services directly from the database, we must forget about single user in the database and use real individual users instead.

This takes us to a scheme like the below:

In our use case we will describe a typical enterprise setup consisting of the above scheme where we use:

Wildfly app server (examples shown for version 10)
LDAP Authentication/Authorization Service
pgbouncer connection pooler
PostgreSQL 10

It seems like a typical setup, since jboss/wildfly has been supporting LDAP authentication and authorization for many years, PostgreSQL has been supporting LDAP for many years.

However pgbouncer only started support for LDAP (and this via PAM) since version 1.8 in late 2017, which means that someone till then could not use the hottest PostgreSQL connection pooler in such an enterprise setup (which did not sound promising by any angle we choose to look at it)!

In this blog, we will describe the setup needed in each layer.

Wildfly 10 Configuration

The data source configuration will have to look like this, I am showing the most important stuff:

<xa-datasource jndi-name="java:/pgsql" pool-name="pgsqlDS" enabled="true" mcp="org.jboss.jca.core.connectionmanager.pool.mcp.LeakDumperManagedConnectionPool">
	<xa-datasource-property name="DatabaseName">
		yourdbname
	</xa-datasource-property>
	<xa-datasource-property name="PortNumber">
		6432
	</xa-datasource-property>
	<xa-datasource-property name="ServerName">
		your.pgbouncer.server
	</xa-datasource-property>
	<xa-datasource-property name="PrepareThreshold">
		0
	</xa-datasource-property>
	<xa-datasource-class>org.postgresql.xa.PGXADataSource</xa-datasource-class>
	<driver>postgresql-9.4.1212.jar</driver>
	<new-connection-sql>
		SET application_name to 'myapp';
	</new-connection-sql>
	<xa-pool>
		<max-pool-size>400</max-pool-size>
		<allow-multiple-users>true</allow-multiple-users>
	</xa-pool>
	<security>
		<security-domain>postgresqluser</security-domain>
	</security>
</xa-datasource>

I have put in bold the important parameters and values. Remember to define the IP address (or hostname), the database name and the port according to your pgbouncer server’s setup.

Also, instead of the typical username/password, you’ll have to have a security domain defined, which must be specified in the data source section as shown above. Its definition will look like:

<security-domain name="postgresqluser">
	<authentication>
		<login-module code="org.picketbox.datasource.security.CallerIdentityLoginModule" flag="required">
			<module-option name="managedConnectionFactoryName" value="name=pgsql,jboss.jca:service=XATxCM"/>
		</login-module>
	</authentication>
</security-domain>

This way wildfly will delegate the security context to pgbouncer.

NOTE: in this blog we cover the basics, i.e. we make no use or mention of TLS, however you are strongly encouraged to use it in your installation.

The wildfly users must authenticate against your LDAP server as follows:

<login-module code="<your login module class>" flag="sufficient">
	<module-option name="java.naming.provider.url" value="ldap://your.ldap.server/"/>
	<module-option name="java.naming.security.authentication" value="simple"/>
	<module-option name="java.naming.factory.initial" value="com.sun.jndi.ldap.LdapCtxFactory"/>
	<module-option name="principalDNPrefix" value="uid="/>
	<module-option name="uidAttributeID" value="memberOf"/>
	<module-option name="roleNameAttributeID" value="cn"/>
	<module-option name="roleAttributeID" value="memberOf"/>
	<module-option name="principalDNSuffix"
	value=",cn=users,cn=accounts,dc=yourorgname,dc=com"/>
	<module-option name="userSrchBase" value="dc=yourorgname,dc=com"/>
	<module-option name="rolesCtxDN"
	value="cn=groups,cn=accounts,dc=yourorgname,dc=com"/>
	<module-option name="matchOnUserDN" value="true"/>
	<module-option name="unauthendicatedIdentity" value="foousr"/>
	<module-option name="com.sun.jndi.ldap.connect.timeout" value="5000"/>
</login-module>

The above configuration files apply to wildfly 10.0, you are advised in any case to consult the official documentation for your environment.

PostgreSQL Configuration

In order to tell PostgreSQL to authenticate (NOTE: not authorise!) against your LDAP server you have to make the appropriate changes to postgresql.conf and pg_hba.conf. The entries of interest are the following:

In postgresql.conf:

listen_addresses = '*'

and in pg_hba.conf:

#TYPE  DATABASE    USER        CIDR-ADDRESS                  METHOD
host    all         all         ip.ofYourPgbouncer.server/32 ldap ldapserver=your.ldap.server ldapprefix="uid=" ldapsuffix=",cn=users,cn=accounts,dc=yourorgname,dc=com"

Make sure the LDAP settings defined here match exactly the ones you defined in your app server configuration. There are two modes of operation that PostgreSQL can be instructed to contact the LDAP server:

simple bind
search and then bind

The simple bind mode requires only one connection to the LDAP server therefore it is faster but requires a somehow stricter LDAP dictionary organization than the second mode. The search and bind mode allows for greater flexibility. However, for the average LDAP directory, the first mode (simple bind) will work just fine. We must underline certain points about PostgreSQL LDAP authentication:

This has to do with authentication only (checking passwords).
Roles membership is still done in PostgreSQL, as usual.
The users must be created in PostgreSQL (via CREATE user/role) as usual.

There are some solutions to help with synchronization between LDAP and PostgreSQL users (e.g. ldap2pg) or you can simply write your own wrapper that will handle both LDAP and PostgreSQL for adding or deleting users.

PostgreSQL Management & Automation with ClusterControl

Learn about what you need to know to deploy, monitor, manage and scale PostgreSQL

Download the Whitepaper

PgBouncer Configuration

This is the hardest part of our setup, due to the fact that native LDAP support is still missing from pgbouncer, and the only option is to authenticate via PAM, which means that this depends on the correct local UNIX/Linux PAM setup for LDAP.

So the procedure is broken into two steps.

The first step is to configure and test that pgbouncer works with PAM, and the second step is to configure PAM to work with LDAP.

pgbouncer

pgbouncer must be compiled with PAM support. In order to do so you will have to:

install libpam0g-dev
./configure --with-pam
recompile and install pgbouncer

Your pgbouncer.ini (or the name of your pgbouncer configuration file) must be configured for pam. Also, it must contain the correct parameters for your database and your application in accordance with the parameters described in the sections above. Things you will have to define or change:

yourdbname = host=your.pgsql.server dbname=yourdbname pool_size=5
listen_addr = *
auth_type = pam
# set pool_mode for max performance
pool_mode = transaction
# required for JDBC
ignore_startup_parameters = extra_float_digits

Of course, you will have to read the pgbouncer docs and tune your pgbouncer according to your needs. In order to test the above setup all you have to do is create a new local UNIX user and try to authenticate to pgbouncer:

# adduser testuser
<answer to all question, including password>

In order for pgbouncer to work with PAM when reading from the local passwd files, pgbouncer executable must be owned by root and with setuid:

# chown root:staff ~pgbouncer/pgbouncer-1.9.0/pgbouncer     
# chmod +s ~pgbouncer/pgbouncer-1.9.0/pgbouncer
# ls -l ~pgbouncer/pgbouncer-1.9.0/pgbouncer           
-rwsrwsr-x 1 root staff 1672184 Dec 21 16:28 /home/pgbouncer/pgbouncer-1.9.0/pgbouncer

Note: The necessity for root ownership and setuid (which is true for every debian/ubuntu system I have tested) is nowhere documented, neither on the official pgbouncer docs nor anywhere on the net.

Then we login (as pgsql superuser) to the postgresql host (or psql -h your.pgsql.server) and create the new user:

CREATE USER testuser PASSWORD 'same as the UNIX passwd you gave above';

then from the pgbouncer host:

psql -h localhost -p 6432 yourdbname -U testuser

You should be able to get a prompt and see the tables as if you were connected directly to your database server. Remember to delete this user from the system and also drop from the database when you are finished with all your tests.

PAM

In order for PAM to interface with the LDAP server, an additional package is needed: libpam-ldap . Its post install script will run a text mode dialog which you will have to answer with the correct parameters for your LDAP server. This package will make the necessary updates in /etc/pam.d files and also create a file named: /etc/pam_ldap.conf. In case something changes in the future you can always go back and edit this file. The most important lines in this file are:

base cn=users,cn=accounts,dc=yourorgname,dc=com
uri ldap://your.ldap.server/
ldap_version 3
pam_password crypt

The name/address of your LDAP server and the search base must be exactly the same as those specified in the PostgreSQL pg_hba.conf and the Wildfly standalone.xml conf files explained above. pam_login_attribute defaults to uid. You are encouraged to take a look at the /etc/pam.d/common-* files and see what changed after the installation of libpam-ldap. Following the docs, you could create a new file named /etc/pam.d/pgbouncer and define all PAM options there, but the default common-* files will suffice. Let’s take a look in /etc/pam.d/common-auth:

auth    [success=2 default=ignore]      pam_unix.so nullok_secure
auth    [success=1 default=ignore]      pam_ldap.so use_first_pass
auth    requisite                       pam_deny.so
auth    required                        pam_permit.so

Unix passwd will be checked first, and if this fails then LDAP will be checked, so bear in mind that you will have to erase any local passwords for those users who are defined both to the local linux/unix /etc/passwd and in LDAP. Now it is time to do the final test. Choose a user who is defined in your LDAP server and also created in PostgreSQL, and try to authenticate from the DB (via pgsql -h your.pgsql.server ), then from pgbouncer (also via psql -h your.pgbouncer.server), and finally via your app. You just made having one single security system for app, connection pooler and PostgreSQL a reality!

Tags:

PostgreSQL

postgres

ldap

authentication

↧

Bruce Momjian: Three Factors of Authentication

January 14, 2019, 1:00 pm

≫ Next: Peter Eisentraut: Maintaining feature branches and submitting patches with Git

≪ Previous: Achilleas Mantzios: One Security System for Application, Connection Pooling and PostgreSQL - The Case for LDAP

Traditionally, passwords were used to prove identity electronically. As computing power has increased and attack vectors expanded, passwords are proving insufficient. Multi-factor authentication uses more than one authentication factor to strengthen authentication checking. The three factors are:

What you know, e.g., password, PIN
What you have, e.g., cell phone, cryptographic hardware
What you are, e.g., finger print, iris pattern, voice

Postgres supports the first option, "What you know," natively using local and external passwords. It supports the second option, "What you have," using cert authentication. If the private key is secured with a password, that adds a second required factor for authentication. Cert only supports private keys stored in the file system, like a local file system or a removable USB memory stick.

One enhanced authentication method allows access to private keys stored on PIV devices, like the YubiKey. There are two advantages of using a PIV device compared to cert:

Requires a PIN, like a private-key password, but locks the device after three incorrect PIN entries (File-system-stored private keys protected with passwords can be offline brute-force attacked.)
While the private key can be used to decrypt and sign data, it cannot be copied from the PIV device, unlike one stored in a file system

↧

Peter Eisentraut: Maintaining feature branches and submitting patches with Git

January 15, 2019, 4:41 am

≫ Next: Joshua Drake: CFP extended until Friday!

≪ Previous: Bruce Momjian: Three Factors of Authentication

I have developed a particular Git workflow for maintaining PostgreSQL feature branches and submitting patches to the pgsql-hackers mailing list and commit fests. Perhaps it’s also useful to others.

This workflow is useful for features that take a long time to develop, will be submitted for review several times, and will require a significant amount of changes over time. In simpler cases, it’s probably too much overhead.

You start as usual with a new feature branch off master

git checkout -b reindex-concurrently master

and code away. Make as many commits as you like for every change you make. Never rebase this branch. Push it somewhere else regularly for backup.

When it’s time to submit your feature for the first time, first merge in the current master branch, fix any conflicts, run all the tests:

git checkout master
git pull
make world
make check-world
git checkout reindex-concurrently
git merge master
# possibly conflict resolution
make world
make check-world

(The actual commands are something like make world -j4 -k and make check-world -Otarget -j4, but I’ll leave out those options in this post for simplicity.)

(Why run the build and tests on the master branch before merging? That ensures that if the build or tests fail later in your branch, it’s because of your code and not because a bad situation in master or a system problem. It happens.)

Now your code is good to go. But you can’t submit it like that, you need to squash it to a single patch. But you don’t want to mess up your development history by overwriting your feature branch. I create what I call a separate “submission” branch for this, like so:

git checkout -b _submission/reindex-concurrently-v1 master

The naming here is just so that this shows up sorted separately in my git branch list. Then

git merge --squash reindex-concurrently

This effectively makes a copy of your feature branch as a single staged commit but doesn’t commit it yet. Now commit it:

git commit

At the first go-round, you should now write a commit message for your feature patch. Then create a patch file:

git format-patch -v1 master --base master

The --base option is useful because it records in your patch what master was when you created the patch. If a reviewer encounters a conflict when applying the patch, they can then apply the patch on the recorded base commit instead. Obviously, you’ll eventually want to fix the conflict with a new patch version, but the reviewing can continue in the meantime.

Then you continue hacking on your feature branch and want to send the next version. First, update master again and check that it’s good:

git checkout master
git pull
make world
make check-world

Then merge it into your feature branch:

git checkout reindex-concurrently
git merge master
# possibly conflict resolution
make world
make check-world

Then make a new submission branch:

git checkout -b _submission/reindex-concurrently-v2 master

Again stage a squashed commit:

git merge --squash reindex-concurrently

Now when you commit you can copy the commit message from the previous patch version:

git commit -C _submission/reindex-concurrently-v1 -e --reset-author

The option -C takes the commit message from the given commit. The option -e allows you to edit the commit message, so you can enhance and refine it for each new version. --reset-author is necessary to update the commit’s author timestamp. Otherwise it keeps using the timestamp of the previous version’s commit.

And again create the patch file with a new version:

git format-patch -v2 master --base master

The advantage of this workflow is that on the one hand you can keep the feature branch evolving without messing with rebases, and on the other hand you can create squashed commits for submission while not having to retype or copy-and-paste commit messages, and you keep a record of what you submitted inside the git system.

↧

Joshua Drake: CFP extended until Friday!

January 15, 2019, 6:30 am

≫ Next: Craig Kerstiens: Contributing to Postgres

≪ Previous: Peter Eisentraut: Maintaining feature branches and submitting patches with Git

We’ve had a great response to our PostgresConf US 2019 call for proposals with over 170 potential presentations -- thank you to everyone who has submitted so far! As with what has become a tradition among Postgres Conferences, we are extending our deadline by one week to allow those final opportunities to trickle in!

The new deadline is Friday, January 18th, submit now!

We accept all topics that relate to People, Postgres, Data including any Postgres related topic, such as open source technologies (Linux, Python, Ruby, Golang, PostGIS).

Talks especially in high demand are sessions related to Regulated Industries including healthtech, fintech, govtech, etc., especially use case and case studies.

Interested in attending this year’s conference?

We’ve expanded our offerings, with trainings and tutorials open to everyone who purchases a Platinum registration. No separate fees for Mondays trainings (but it will be first come, first serve for seating).

Don’t forget that Early Bird registration ends this Friday, January 18. Tickets are substantially discounted when purchased early.

Register for PostgresConf 2019

Interested in an AWESOME international Postgres Conference opportunity? Consider attending PgConf Russia

↧

Craig Kerstiens: Contributing to Postgres

January 15, 2019, 9:48 am

≫ Next: Bruce Momjian: Removable Certificate Authentication

≪ Previous: Joshua Drake: CFP extended until Friday!

About once a month I get this question: “How do I contribute to Postgres?”. PostgreSQL is a great database with a solid code base and for many of us, contributing back to open source is a worthwhile cause. The thing about contributing back to Postgres is you generally don’t just jump right in and commit code on day one. So figuring out where to start can be a bit overwhelming. If you’re considering getting more involved with Postgres, here’s a few tips that you may find helpful.

Follow what’s happening

The number one way to familiarize yourself with the Postgres development and code community is to subscribe to the mailing lists. Even if you’re not considering contributing back, the mailing lists can be a great place to level up your knowledge and skills around Postgres. Fair warning: the mailing lists can be very active. But that’s ok, as you don’t necessarily need to read every email as it happens—daily digests work just fine. There is a long list of mailing lists you can subscribe to, but here are a few I think you should know about:

pgsql-general - This is the most active of mailing lists where you’ll find questions about working with Postgres and troubleshooting. It’s a great place to start to chime in and help others as you see questions.
pgsql-hackers - Where core development happens. A must read to follow along for a few months before you start contributing yourself
pgsql-announce - Major announcements about new releases and happenings with Postgres.
pgsql-advocacy - If you’re more interested in the evangelism side this one is worth a subscription.

Following along to these lists will definitely prepare you to contribute code in the future. And will give you the opportunity to chime in to the discussions.

Familiarize yourself with the process

As you are reading along with the mailing lists the docs will become one of your best friends for understanding how certain things work. Postgres docs are rich with how things work so when you have questions best to check there ahead of asking what may have already been answered. Different areas may be slightly different to contribute to, it can range from docs improvements to bug fixes to new features. The PostgreSQL wiki has its own guide to help prepare you for contributing.

So you think you want to contribute?

You’ve read the mailing lists, you’ve chimed in to some discussions, you’re ready. But where to begin?

One of the best places is to help with code review of open patches during a commitfest. You see, development happens in sprints (known as commitfests) for Postgres. Within the commitfest app you can browse the various commitfests, browse open patches, and review/comment on them. Often just as much work can go into review of patches as there is in writing them, so contributing on review and testing can be extremely helpful.

You’ve reviewed some patches, you’re ready to write your own

Before you just jump right in and write a patch it can be good to test the waters. Consider starting back at step one with a discussion on the mailing list. Lay out the problem you see and that you’re interested in helping to fix it. Sometimes someone else may already be working on it, or may see dangers in that area, or just maybe the mailing list is overwhelmingly on board and looking forward to your patch. What you picked up from following the mailing lists and reviewing patches already should do you well as you start to dig in. Then once it is written submit it during and open commitfest.

Code isn’t the only way

Yes, Postgres is code and at heart it is a database. But it is more than that, it’s a community. Postgres doesn’t work on code alone, it works from people sharing their knowledge on it and helping others. If you don’t feel quite ready to jump in and contribute, there are plenty of opportunities to give back such as joining your area non-profit to support evangelism efforts: PostgreSQL.US in the US, and PostgreSQL EU in Europe. You can also yourself start blogging about Postgres (if you do register your blog for PostgreSQL Planet) or consider speaking at your nearby user group.

Not everyone will be a PostgreSQL developer, but that doesn’t mean you can’t follow along and participate. If you just want to lurk that’s perfectly fine. Consider subscribing to Postgres Weekly or the Citus newsletter and follow on twitter @postgresql.

↧

Bruce Momjian: Removable Certificate Authentication

January 16, 2019, 10:00 am

≫ Next: Michael Paquier: Postgres 12 highlight - SKIP_LOCKED for VACUUM and ANALYZE

≪ Previous: Craig Kerstiens: Contributing to Postgres

I mentioned previously that it is possible to implement certificate authentication on removable media, e.g., a USB memory stick. This blog post shows how it is done. First, root and server certificates and key files must be created:

$ cd $PGDATA
 
# create root certificate and key file
$ openssl req -new -nodes -text -out root.csr -keyout root.key -subj "/CN=root.momjian.us"
$ chmod og-rwx root.key
$ openssl x509 -req -in root.csr -text -days 3650 -extfile /etc/ssl/openssl.cnf -extensions v3_ca -signkey root.key -out root.crt
 
# create server certificate and key file
$ openssl req -new -nodes -text -out server.csr -keyout server.key -subj "/CN=momjian.us"
$ chmod og-rwx server.key
$ openssl x509 -req -in server.csr -text -days 365 -CA root.crt -CAkey root.key -CAcreateserial -out server.crt

↧

Michael Paquier: Postgres 12 highlight - SKIP_LOCKED for VACUUM and ANALYZE

January 16, 2019, 11:25 pm

≫ Next: Venkata Nagothi: An Overview of JSON Capabilities Within PostgreSQL

≪ Previous: Bruce Momjian: Removable Certificate Authentication

The following commit has been merged into Postgres 12, adding a new option for VACUUM and ANALYZE:

commit: 803b1301e8c9aac478abeec62824a5d09664ffff
author: Michael Paquier <michael@paquier.xyz>
date: Thu, 4 Oct 2018 09:00:33 +0900
Add option SKIP_LOCKED to VACUUM and ANALYZE

When specified, this option allows VACUUM to skip the work on a relation
if there is a conflicting lock on it when trying to open it at the
beginning of its processing.

Similarly to autovacuum, this comes with a couple of limitations while
the relation is processed which can cause the process to still block:
- when opening the relation indexes.
- when acquiring row samples for table inheritance trees, partition trees
or certain types of foreign tables, and that a lock is taken on some
leaves of such trees.

Author: Nathan Bossart
Reviewed-by: Michael Paquier, Andres Freund, Masahiko Sawada
Discussion: https://postgr.es/m/9EF7EBE4-720D-4CF1-9D0E-4403D7E92990@amazon.com
Discussion: https://postgr.es/m/20171201160907.27110.74730@wrigleys.postgresql.org

Postgres 11 has extended VACUUM so as multiple relations can be specified in a single query, processing each relation one at a time. However if VACUUM gets stuck on a relation which is locked for a reason or another for a long time, it is up to the application layer which has triggered VACUUM to be careful to look at that and unblock the situation. SKIP_LOCKED brings more control regarding that by skipping immediately any relation that cannot be locked at the beginning of VACUUM or ANALYZE processing, meaning that the processing will finish on a timely manner at the cost of potentially doing nothing, which can also be dangerous if a table keeps accumulating bloat and is not cleaned up. As mentioned in the commit message, there are some limitations similar to autovacuum:

Relation indexes may need to be locked, which would cause the processing to still block when working on them.
The list of relations part of a partition or inheritance tree to process is built at the beginning of VACUUM or ANALYZE. If the parent table is locked, then none of its children are processed. If one of the children is locked and that the parent is listed in VACUUM, then all members of the trees are processed except the child locked. However a limitation comes in the middle of acquiring sample rows for trees, as ANALYZE would block if a lock is acquired on a child when acquiring row samples for statistics on the parent.

This option is only supported with the parenthesized grammar of those commands, for example:

=# VACUUM (SKIP_LOCKED) tab1, tab2;
WARNING:  55P03: skipping vacuum of "tab1" --- lock not available
LOCATION:  expand_vacuum_rel, vacuum.c:654
VACUUM

And in this case the second table listed got locked.

On the way, note that more options have been added to vacuumdb thanks to this commit:

commit: 354e95d1f2122d20c1c5895eb3973cfb0e8d0cc2
author: Michael Paquier <michael@paquier.xyz>
date: Tue, 8 Jan 2019 10:52:29 +0900
Add --disable-page-skipping and --skip-locked to vacuumdb

DISABLE_PAGE_SKIPPING is available since v9.6, and SKIP_LOCKED since
v12.  They lacked equivalents for vacuumdb, so this closes the gap.

Author: Nathan Bossart
Reviewed-by: Michael Paquier, Masahiko Sawada
Discussion: https://postgr.es/m/FFE5373C-E26A-495B-B5C8-911EC4A41C5E@amazon.com

So, combined with –table, it is possible to get the same mapping as what VACUUM and ANALYZE provide, though DISABLE_PAGE_SKIPPING is present since 9.6. This feature is also added into Postgres 12.

↧

Venkata Nagothi: An Overview of JSON Capabilities Within PostgreSQL

January 17, 2019, 12:19 pm

≫ Next: Luca Ferrari: PostgreSQL to Microsoft SQL Server Using TDS Foreign Data Wrapper

≪ Previous: Michael Paquier: Postgres 12 highlight - SKIP_LOCKED for VACUUM and ANALYZE

What is JSON?

JSON stands for “JavaScript Object Notification” which is a type of data format popularly used by web applications. This means, the data would be transmitted between web applications and servers in such a format. JSON was introduced as an alternative to the XML format. In the “good old days” the data used to get transmitted in XML format which is a heavy weight data type compared to JSON.Below is an example of JSON formatted string:

{ "ID":"001","name": "Ven", "Country": "Australia",  "city": "Sydney", "Job Title":"Database Consultant"}

A JSON string can contain another JSON object with-in itself like shown below:

{ "ID":"001", "name": "Ven", "Job Title":"Database Consultant", "Location":{"Suburb":"Dee Why","city": "Sydney","State":"NSW","Country": "Australia"}}

Modern day web and mobile applications mostly generate the data in JSON format, also termed as “JSON Bytes” which are picked up by the application servers and are sent across to the database. The JSON bytes are in-turn processed, broken down into separate column values and inserted into an RDBMS table.
Example:

{ "ID":"001","name": "Ven", "Country": "Australia",  "city": "Sydney", "Job Title":"Database Consultant"}

Above JSON data is converted to an SQL like below..

Insert into test (id, name, country,city,job_title) values  (001,'Ven','Australia','Sydney','Database Consultant');

When it comes to storing and processing the JSON data, there are various NoSQL databases supporting it and the most popular one is MongoDB. When it comes to RDBMS databases, until recent times, JSON strings were treated as normal text and there were no data types which specifically recognize, store or process JSON format strings. PostgreSQL, the most popular open-source RDBMS database has come up with JSON data-type which turned out to be highly beneficial for performance, functionality and scalability when it comes to handling JSON data.

PostgreSQL + JSON

PostgreSQL database has become more-and-more popular ever since the JSON data-type was introduced. In-fact, PostgreSQL has been outperforming MongoDB when it comes to processing a large amount of JSON data. The applications can store JSON strings in the PostgreSQL database in the standard JSON format. Developers just need to tell the application to send across the JSON strings to the database as a json data-type and retrieve back in the JSON format. Storing of JSON string in JSON data-type has several advantageous compared to storing the same in TEXT data-type. JSON data-type can accept only valid JSON formatted strings, if the string is not in correct JSON format, an error is generated. JSON data-type helps the application perform efficient and Index based searches which, we will see in detail shortly.

The JSON data-type was introduced in PostgreSQL-9.2 post which, significant enhancements were made. The major addition came-up in PostgreSQL-9.4 with the addition of JSONB data-type. JSONB is an advanced version of JSON data-type which stores the JSON data in binary format. This is the major enhancement which made a big difference to the way JSON data was searched and processed in PostgreSQL. Let us have a detailed look at the advantages of JSON data types.

JSON and JSONB Data Types

JSON data-type stores json formatted strings as a text which is not very powerful and does not support many JSON related functions used for searches. It supports only traditional B-TREE indexing and does not support other Index types which are imperative for faster and efficient search operations across JSON data.

JSONB, the advanced version of JSON data type, is highly recommended for storing and processing JSON documents. It supports a wide range of json operators and has numerous advantages over JSON, like storing JSON formatted strings in binary format, and supporting JSON functions and indexing, to perform efficient searches.

Let us look at the differences.

	JSON	JSONB
1	Pretty much like a TEXT data type which stores only valid JSON document.	Stores the JSON documents in Binary format.
2	Stores the JSON documents as-is including white spaces.	Trims off white spaces and stores in a format conducive for faster and efficient searches
3	Does not support FULL-TEXT-SEARCH Indexing	Supports FULL-TEXT-SEARCH Indexing
4	Does not support wide range of JSON functions and operators	Supports all the JSON functions and operators

Example for #4 Listed Above

JSON

Below is a table with JSON data type

dbt3=# \d product
                   Table "dbt3.product"
     Column     |  Type  | Collation | Nullable | Default
----------------+--------+-----------+----------+---------
 item_code      | bigint |           | not null |
 productdetails | json   |           |          |
Indexes:
    "product_pkey" PRIMARY KEY, btree (item_code)

Does not support traditional JSON operators (like “@>” or “#>”). Full-Text-Search through JSON data is done using “@>” or “#>” in an SQL which is not supported by JSON data type

dbt3=# select * from product where productdetails @> '{"l_shipmode":"AIR"}' and productdetails @> '{"l_quantity":"27"}';
ERROR:  operator does not exist: json @> unknown
LINE 1: select * from product where productdetails @> '{"l_shipmode"...
                                                   ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.
dbt3=#

JSONB

Below is a table with JSONB data type

dbt3=# \d products
                  Table "dbt3.products"
    Column     |  Type  | Collation | Nullable | Default
---------------+--------+-----------+----------+---------
 item_code     | bigint |           | not null |
 order_details | jsonb  |           |          |
Indexes:
    "products_pkey" PRIMARY KEY, btree (item_code)

Supports FULL-TEXT-SEARCHING through JSON data using operators (like “@>”)

dbt3=# select * from products where order_details @> '{"l_shipmode" : "AIR"}' limit 2;
 item_code |                                                                                        order_details
-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
         4 | {"l_partkey": 21315, "l_orderkey": 1, "l_quantity": 28, "l_shipdate": "1996-04-21", "l_shipmode": "AIR", "l_commitdate": "1996-03-30", "l_shipinstruct": "NONE", "l_extendedprice": 34616.7}
         8 | {"l_partkey": 42970, "l_orderkey": 3, "l_quantity": 45, "l_shipdate": "1994-02-02", "l_shipmode": "AIR", "l_commitdate": "1994-01-04", "l_shipinstruct": "NONE", "l_extendedprice": 86083.6}
(2 rows)

PostgreSQL Management & Automation with ClusterControl

Learn about what you need to know to deploy, monitor, manage and scale PostgreSQL

Download the Whitepaper

How to Query JSON Data

Let us take a look at some PostgreSQL JSON capabilities related to data operationsBelow is how the JSON data looks in a Table. Column “order_details” is of type JSONB

dbt3=# select * from product_details ;
 item_code |                                                                                                 order_details
-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
         1 | {"l_partkey": 1551894, "l_orderkey": 1, "l_quantity": 17, "l_shipdate": "1996-03-13", "l_shipmode": "TRUCK", "l_commitdate": "1996-02-12", "l_shipinstruct": "DELIVER IN PERSON", "l_extendedprice": 33078.9}
         2 | {"l_partkey": 673091, "l_orderkey": 1, "l_quantity": 36, "l_shipdate": "1996-04-12", "l_shipmode": "MAIL", "l_commitdate": "1996-02-28", "l_shipinstruct": "TAKE BACK RETURN", "l_extendedprice": 38306.2}
         3 | {"l_partkey": 636998, "l_orderkey": 1, "l_quantity": 8, "l_shipdate": "1996-01-29", "l_shipmode": "REG AIR", "l_commitdate": "1996-03-05", "l_shipinstruct": "TAKE BACK RETURN", "l_extendedprice": 15479.7}
         4 | {"l_partkey": 21315, "l_orderkey": 1, "l_quantity": 28, "l_shipdate": "1996-04-21", "l_shipmode": "AIR", "l_commitdate": "1996-03-30", "l_shipinstruct": "NONE", "l_extendedprice": 34616.7}
         5 | {"l_partkey": 240267, "l_orderkey": 1, "l_quantity": 24, "l_shipdate": "1996-03-30", "l_shipmode": "FOB", "l_commitdate": "1996-03-14", "l_shipinstruct": "NONE", "l_extendedprice": 28974}
         6 | {"l_partkey": 156345, "l_orderkey": 1, "l_quantity": 32, "l_shipdate": "1996-01-30", "l_shipmode": "MAIL", "l_commitdate": "1996-02-07", "l_shipinstruct": "DELIVER IN PERSON", "l_extendedprice": 44842.9}
         7 | {"l_partkey": 1061698, "l_orderkey": 2, "l_quantity": 38, "l_shipdate": "1997-01-28", "l_shipmode": "RAIL", "l_commitdate": "1997-01-14", "l_shipinstruct": "TAKE BACK RETURN", "l_extendedprice": 63066.3}
         8 | {"l_partkey": 42970, "l_orderkey": 3, "l_quantity": 45, "l_shipdate": "1994-02-02", "l_shipmode": "AIR", "l_commitdate": "1994-01-04", "l_shipinstruct": "NONE", "l_extendedprice": 86083.6}
         9 | {"l_partkey": 190355, "l_orderkey": 3, "l_quantity": 49, "l_shipdate": "1993-11-09", "l_shipmode": "RAIL", "l_commitdate": "1993-12-20", "l_shipinstruct": "TAKE BACK RETURN", "l_extendedprice": 70822.1}
        10 | {"l_partkey": 1284483, "l_orderkey": 3, "l_quantity": 27, "l_shipdate": "1994-01-16", "l_shipmode": "SHIP", "l_commitdate": "1993-11-22", "l_shipinstruct": "DELIVER IN PERSON", "l_extendedprice": 39620.3}
(10 rows)

Select all the item codes including their shipment dates

dbt3=# select item_code, order_details->'l_shipdate' as shipment_date from product_details ;

 item_code | shipment_date
-----------+---------------
         1 | "1996-03-13"
         2 | "1996-04-12"
         3 | "1996-01-29"
         4 | "1996-04-21"
         5 | "1996-03-30"
         6 | "1996-01-30"
         7 | "1997-01-28"
         8 | "1994-02-02"
         9 | "1993-11-09"
        10 | "1994-01-16"
(10 rows)

Get the item_code, quantity and price of all the orders arrived by air

dbt3=# select item_code, order_details->'l_quantity' as quantity, order_details->'l_extendedprice' as price, order_details->'l_shipmode' as price from product_details where order_details->>'l_shipmode'='AIR';

 item_code | quantity |  price  | price
-----------+----------+---------+-------
         4 | 28       | 34616.7 | "AIR"
         8 | 45       | 86083.6 | "AIR"
(2 rows)

The JSON operators “->” and “->>” are used for selections and comparisons in the SQL query. The “->” operator returns the JSON Object field as a field in quotes and the operator “->>” returns the JSON object field as TEXT. The above two SQLs are examples of displaying JSON field values as-is. Below is the example of extracting the JSON field in the TEXT form.
Below is an example of fetching the JSON field in the form of TEXT

dbt3=# select item_code, order_details->>'l_shipdate' as shipment_date from product_details ;
 item_code | shipment_date
-----------+---------------
         1 | 1996-03-13
         2 | 1996-04-12
         3 | 1996-01-29
         4 | 1996-04-21
         5 | 1996-03-30
         6 | 1996-01-30
         7 | 1997-01-28
         8 | 1994-02-02
         9 | 1993-11-09
        10 | 1994-01-16
(10 rows)

There is another operator called “#>” which is used to query the data part of a JSON Element which is in-turn part of a JSON string. Let us look at an example.
Below is the data in the table.

dbt3=# select * from test_json ;
  id   |                                                                                                details
-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 10000 | {"Job": "Database Consultant", "name": "Venkata", "Location": {"city": "Sydney", "State": "NSW", "Suburb": "Dee Why", "Country": "Australia"}}
 20000 | {"Job": "Database Consultant", "name": "Smith", "Location": {"city": "Sydney", "State": "NSW", "Suburb": "Manly", "Country": "Australia"}}
 30000 | {"Job": "Developer", "name": "John", "Location": {"city": "Sydney", "State": "NSW", "Suburb": "Brookvale", "Country": "Australia"}}
 50000 | {"cars": {"Ford": [{"doors": 4, "model": "Taurus"}, {"doors": 4, "model": "Escort"}], "Nissan": [{"doors": 4, "model": "Sentra"}, {"doors": 4, "model": "Maxima"}, {"doors": 2, "model": "Skyline"}]}}
 40000 | {"Job": "Architect", "name": "James", "Location": {"city": "Melbourne", "State": "NSW", "Suburb": "Trugnania", "Country": "Australia"}}

I want to see all the details with “State” “NSW” and “State” is the JSON object key which is part of the key “Location”. Below is how to query the same.

dbt3=# select * from test_json where details #> '{Location,State}'='"NSW"';
  id   |                                                                    details
-------+------------------------------------------------------------------------------------------------------------------------------------------------
 10000 | {"Job": "Database Consultant", "name": "Venkata", "Location": {"city": "Sydney", "State": "NSW", "Suburb": "Dee Why", "Country": "Australia"}}
 20000 | {"Job": "Database Consultant", "name": "Smith", "Location": {"city": "Sydney", "State": "NSW", "Suburb": "Manly", "Country": "Australia"}}
 30000 | {"Job": "Developer", "name": "John", "Location": {"city": "Sydney", "State": "NSW", "Suburb": "Brookvale", "Country": "Australia"}}
 30000 | {"Job": "Architect", "name": "James", "Location": {"city": "Melbourne", "State": "NSW", "Suburb": "Trugnania", "Country": "Australia"}}
(4 rows)

Arithmetic operations can be performed on JSON data. Type casting is needed as the data part of JSON column is TEXT.

dbt3=# select item_code, order_details->'l_quantity' as quantity, order_details->'l_extendedprice' as price, order_details->'l_shipmode' as price from product_details where (order_details->'l_quantity')::int > 10;
 item_code | quantity |  price  |  price
-----------+----------+---------+---------
         1 | 17       | 33078.9 | "TRUCK"
         2 | 36       | 38306.2 | "MAIL"
         4 | 28       | 34616.7 | "AIR"
         5 | 24       | 28974   | "FOB"
         6 | 32       | 44842.9 | "MAIL"
         7 | 38       | 63066.3 | "RAIL"
         8 | 45       | 86083.6 | "AIR"
         9 | 49       | 70822.1 | "RAIL"
        10 | 27       | 39620.3 | "SHIP"
(9 rows)

Apart from all of the above, following operations can also be performed on JSON using SQLs including JOINs

Sorting the data using ORDER BY clause
Aggregation using aggregate functions like SUM, AVG, MIN, MAX etc
Group the data using GROUP BY clause

How About Performance?

The data in JSON columns will be of text in nature and based on the data size performance problems can be expected. Searches through JSON data can taketime and computing power resulting in slow responses to the application(s). It is imperative for DBAs to ensure SQLs hitting the JSON columns are responding fast enough and rendering good performance. Since the data extraction is done via SQL, The option the DBAs would look for is the possibility of Indexing and yes, JSON Data types do support Indexing options.

Let us take a look at the Indexing options JSON brings us.

Indexing JSONB

JSONB data type supports FULL-TEXT-SEARCH Indexing. This is the most important capability of JSONB which DBAs will be looking forward to when using JSONB data types. A normal Index on an JSON Object key may not help when using JSON specific operators in the search queries.Below is an TEXT SEARCH query which goes for a FULL-TABLE-SCAN

dbt3=# explain select * from products where order_details @> '{"l_shipmode" : "AIR"}';
                             QUERY PLAN
--------------------------------------------------------------------
 Seq Scan on products  (cost=0.00..4205822.65 rows=59986 width=252)
   Filter: (order_details @> '{"l_shipmode": "AIR"}'::jsonb)
(2 rows)

JSONB supports FULL-TEXT-SEARCH Index type called GIN which helps queries like above.
Now, let me create a GIN Index and see if that helps

dbt3=# create index od_gin_idx on products using gin(order_details jsonb_path_ops);
CREATE INDEX

If you can observe below, the query pickups the GIN Index

dbt3=# explain select * from products where order_details @> '{"l_shipmode" : "AIR"}';
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Bitmap Heap Scan on products  (cost=576.89..215803.18 rows=59986 width=252)
   Recheck Cond: (order_details @> '{"l_shipmode": "AIR"}'::jsonb)
   ->  Bitmap Index Scan on od_gin_idx  (cost=0.00..561.90 rows=59986 width=0)
         Index Cond: (order_details @> '{"l_shipmode": "AIR"}'::jsonb)

And a B-TREE index instead of GIN would NOT help

dbt3=# create index idx on products((order_details->>'l_shipmode'));
CREATE INDEX

dbt3=# \d products
                  Table "dbt3.products"
    Column     |  Type  | Collation | Nullable | Default
---------------+--------+-----------+----------+---------
 item_code     | bigint |           | not null |
 order_details | jsonb  |           |          |
Indexes:
    "products_pkey" PRIMARY KEY, btree (item_code)
    "idx" btree ((order_details ->> 'l_shipmode'::text))

You can see below, the query prefers FULL-TABLE-SCAN

dbt3=# explain select * from products where order_details @> '{"l_shipmode" : "AIR"}';
                             QUERY PLAN
--------------------------------------------------------------------
 Seq Scan on products  (cost=0.00..4205822.65 rows=59986 width=252)
   Filter: (order_details @> '{"l_shipmode": "AIR"}'::jsonb)

What is GIN Index ?

GIN stands for Generalised Inverted Index. The core capability of GIN Index is to speed up full text searches. When performing searching based on specific Keys or elements in a TEXT or a document, GIN Index is the way to go. GIN Index stores “Key” (or an element or a value) and the “position list” pairs. The position list is the rowID of the key. This means, if the “Key” occurs at multiple places in the document, GIN Index stores the Key only once along with its position of occurrences which not only keeps the GIN Index compact in size and also helps speed-up the searches in a great way. This is the enhancement in Postgres-9.4.

Challenges with GIN Index

Depending on the complexity of the data, maintaining GIN Indexes can be expensive. Creation of GIN Indexes consumes time and resources as the Index has to search through the whole document to find the Keys and their row IDs. It can be even more challenging if the GIN index is bloated. Also, the size of the GIN index can be very big based on the data size and complexity.

Indexing JSON

JSON does not support text searching and Indexes like GIN

dbt3=# create index pd_gin_idx on product using gin(productdetails jsonb_path_ops);
ERROR:  operator class "jsonb_path_ops" does not accept data type json

Normal Indexing like B-TREE is supported by both JSON and JSONB

Yes, normal indexes like B-TREE Index is supported by both JSON and JSONB data types and is not conducive for text search operations. Each JSON object key can be Indexed individually which would really help ONLY when the same object key is used in the WHERE clause.
Let me create an B-TREE Index on JSONB and see how it works

dbt3=# create index idx on products((order_details->>'l_shipmode'));
CREATE INDEX

dbt3=# \d products
                  Table "dbt3.products"
    Column     |  Type  | Collation | Nullable | Default
---------------+--------+-----------+----------+---------
 item_code     | bigint |           | not null |
 order_details | jsonb  |           |          |
Indexes:
    "products_pkey" PRIMARY KEY, btree (item_code)
    "idx" btree ((order_details ->> 'l_shipmode'::text))

We have already learned above that a B-TREE index is NOT useful for speeding up SQLs doing FULL-TEXT-SEARCHING on the JSON data using operators (like “@>”) , and such Indexes would ONLY help speed-up the queries like the one below, which are typical RDBMS type SQLs (which are not search queries). Each of the JSON Object key can be Indexed individually, which would help queries speed-up when those Indexed JSON Object Keys are used the WHERE clause.
The example below query uses “l_shipmode” Object Key in the WHERE clause and since it is Indexed the query is going for an index scan. If you wish to search using a different Object Key, then, the query would choose to do a FULL-TABLE-SCAN.

dbt3=# explain select * from products where order_details->>'l_shipmode'='AIR';
                                   QUERY PLAN
---------------------------------------------------------------------------------
 Index Scan using idx on products  (cost=0.56..1158369.34 rows=299930 width=252)
   Index Cond: ((order_details ->> 'l_shipmode'::text) = 'AIR'::text)

Same works with JSON data type as well

dbt3=# create index idx on products((order_details->>'l_shipmode'));
CREATE INDEX

dbt3=# \d products
                  Table "dbt3.products"
    Column     |  Type  | Collation | Nullable | Default
---------------+--------+-----------+----------+---------
 item_code     | bigint |           | not null |
 order_details | json  |           |          |
Indexes:
    "products_pkey" PRIMARY KEY, btree (item_code)
    "idx" btree ((order_details ->> 'l_shipmode'::text))

If you can observe, the query is using the Index

dbt3=# explain select * from products where order_details->>'l_shipmode'='AIR';
                                   QUERY PLAN
---------------------------------------------------------------------------------
 Index Scan using idx on products  (cost=0.56..1158369.34 rows=299930 width=252)
   Index Cond: ((order_details ->> 'l_shipmode'::text) = 'AIR'::text)

Conclusion

Here are some things to remember when using PostgreSQL JSON Data...

PostgreSQL is one of the best options to store and process JSON Data
With all the powerful features, PostgreSQL can be your document database
I have seen architectures where two or more data stores are chosen, with a mixture of PostgreSQL and NoSQL databases like MongoDB or Couchbase database. A REST API would help applications push the data to different data stores. With PostgreSQL supporting JSON this complexity in architecture can be avoided by just choosing one data store.
JSON data in PostgreSQL can be queried and Indexed which renders incredible performance and scalability
JSONB Data type is the most preferred option as it is good at storage and performance. Fully supports FULL-TEXT-SEARCHING and Indexing. Renders good performance
Use JSON data type only if you want to store JSON strings as JSON and you are not performing much complex text searches
The biggest advantage of having JSON in PostgreSQL is that the search can be performed using SQLs
The JSON search performance in PostgreSQL has been on-par with best NoSQL databases like MongoDB

Tags:

PostgreSQL

postgres

json

↧

Luca Ferrari: PostgreSQL to Microsoft SQL Server Using TDS Foreign Data Wrapper

January 17, 2019, 4:00 pm

≫ Next: Hans-Juergen Schoenig: pg_permission: Inspecting your PostgreSQL security system

≪ Previous: Venkata Nagothi: An Overview of JSON Capabilities Within PostgreSQL

I needed to push data from a Microsoft SQL Server 2005 to our beloved database, so why don’t use a FDW to the purpose? It has not been as simple as with other FDW, but works!

PostgreSQL to Microsoft SQL Server Using TDS Foreign Data Wrapper

At work I needed to push data out from a Microsoft SQL Server 2005 to a PostgreSQL 11 instance. Foreign Data Wrappers was my first thought! Perl to the rescue was my second, but since I had some time, I decided to investigate the first way first.

The scenario was the following:

CentOS 7 machine running PostgreSQL 11, it is not my preferred setup (I do prefer either FreeBSD or Ubuntu), but I have to deal with that;
Microsoft SQL Server 2005 running on a Windows Server , surely not something I like to work with and to which I had to connect via remote desktop (argh!).

First Step: get TDS working

After a quick research on the web, I discovered that MSSQL talks the Table Data Stream (TDS for short), so I don’t need to install an ODBC stack on my Linux box. And luckily, there are binaries for CentOS:

$ sudo yum install freetds $ sudo yum install freetds-devel freetds-doc 

freetds comes along with a tsql terminal command that is meant to be a diagnosing tool, so nothing as complete as a psql terminal. You should really test your connectivity with tsql before proceeding further, since it can save you hours of debugging when things do not work.
Thanks to a pragmatic test with tsql I discovered that I needed to open port 1433 (default MSSQL port) on our...

↧

Hans-Juergen Schoenig: pg_permission: Inspecting your PostgreSQL security system

January 21, 2019, 7:34 am

≫ Next: Bruce Momjian: Insufficient Passwords

≪ Previous: Luca Ferrari: PostgreSQL to Microsoft SQL Server Using TDS Foreign Data Wrapper

Security is a super important topic. This is not only true in the PostgreSQL world – it holds true for pretty much any modern IT system. Databases, however, have special security requirements. More often than not confidential data is stored and therefore it makes sense to ensure that data is protected properly. Security first !

PostgreSQL: Listing all permissions

Gaining an overview of all permissions granted to users in PostgreSQL can be quite difficult. However, if you want to secure your system, gaining an overview is really everything – it can be quite easy to forget a permission here and there and fixing things can be a painful task. To make life easier, Cybertec has implemented pg_permission (https://github.com/cybertec-postgresql/pg_permission). There are a couple of things, which can be achieved with pg_permission:

Gain a faster overview and list all permissions
Compare your “desired state” to what you got
Instantly fix errors

In short: pg_permission can do more than just listing what there is. However, let us get started with the simple case – listing all permissions. pg_permission provides a couple of views, which can be accessed directly once the extension has been deployed. Here is an example:

test=# \x
Expanded display is on.
test=# SELECT * 
       FROM 	all_permissions 
       WHERE 	role_name = 'workspace_owner';
-[ RECORD 1 ]-------------------------------------------------------
object_type | TABLE
role_name   | workspace_owner
schema_name | public
object_name | b
column_name | 
permission  | SELECT
granted     | t
-[ RECORD 2 ]-------------------------------------------------------
object_type | TABLE
role_name   | workspace_owner
schema_name | public
object_name | b
column_name | 
permission  | INSERT
granted     | t
-[ RECORD 3 ]-------------------------------------------------------
object_type | TABLE
role_name   | workspace_owner
schema_name | public
object_name | b
column_name | 
permission  | UPDATE
granted     | f

The easiest way is to use the “all_permissions” view to gain an overview of EVERYTHING. However, if you are only interested in function, tables, columns, schemas and so on there are more views, which you can use. “all_permissions” will simply show you all there is:

CREATE VIEW all_permissions AS
SELECT * FROM table_permissions
UNION ALL
SELECT * FROM view_permissions
UNION ALL
SELECT * FROM column_permissions
UNION ALL
SELECT * FROM sequence_permissions
UNION ALL
SELECT * FROM function_permissions
UNION ALL
SELECT * FROM schema_permissions
UNION ALL
SELECT * FROM database_permissions;

PostgreSQL: Detecting security issues

Securing your application is not too hard when your application is small – however, if your data model is changing small errors and deficiencies might sneak in, which can cause severe security problems in the long run. pg_permissions has a solution to that problem: You can declare, how the world is supposed to be. What does that mean? Here is an example: “All bookkeepers should be allowed to read data in the bookeeping schema.” or “Everybody should have USAGE permissions on all schemas”. What you can do now is to compare the world as it is with the way you want it to be. Here is how it works:

INSERT INTO public.permission_target
   (id, role_name, permissions, object_type, schema_name)
VALUES
   (3, 'appuser', '{USAGE}', 'SCHEMA', 'appschema');

The user also needs USAGE privileges on the appseq sequence in that schema:

INSERT INTO public.permission_target
   (id, role_name, permissions,
    object_type, schema_name, object_name)
VALUES
   (4, 'appuser', '{USAGE}', 'SEQUENCE', 'appschema', 'appseq');

SELECT * FROM public.permission_diffs();
 missing | role_name | object_type | schema_name | object_name | column_name | permission
---------+-----------+-------------+-------------+-------------+-------------+------------
       f |      hans |        VIEW |   appschema |     appview |             |     SELECT
       t |   appuser |       TABLE |   appschema |    apptable |             |     DELETE
(2 rows)

You will instantly get an overview and see, which differences between your desired state and your current state exist. By checking the differences directly during your deployment process, our extension will allow you to react and fix problems quickly.

Changing permissions as fast as possible

Once you have figured out, which permissions there are, which ones might be missing or which ones are wrong, you might want to fix things. Basically there are two choices: You can fix stuff by hand and assign permissions one by one. That can be quite a pain and result in a lot of work. So why not just update your “all_permissions” view directly? pg_permissions allows you to do exactly that … You can simply update your views and pg_permission will execute the desired changes for you (fire GRANT and REVOKE statements behind the scene). This way you can change hundreds or even thousands of permission using a simple UPDATE statement. Securing your database has never been easier.

Many people are struggling with GRANT and REVOKE statements. Therefore being able to use UPDATE might make life easier for many PostgreSQL users out there.

Making pg_permission even better

We want to make pg_permission even better. So if there are any cool ideas out there, don’t hesitate to contact us anytime. We are eagerly looking for new ideas and even better concepts.

The post pg_permission: Inspecting your PostgreSQL security system appeared first on Cybertec.

↧

Bruce Momjian: Insufficient Passwords

January 21, 2019, 7:45 am

≫ Next: Jonathan Katz: Scheduling Backups En Masse with the Postgres Operator

≪ Previous: Hans-Juergen Schoenig: pg_permission: Inspecting your PostgreSQL security system

As I already mentioned, passwords were traditionally used to prove identity electronically, but are showing their weakness with increased computing power has increased and expanded attack vectors. Basically, user passwords have several restrictions:

must be simple enough to remember
must be short enough to type repeatedly
must be complex enough to not be easily guessed
must be long enough to not be easily cracked (discovered by repeated password attempts) or the number of password attempts must be limited

As you can see, the simple/short and complex/long requirements are at odds, so there is always a tension between them. Users often choose simple or short passwords, and administrators often add password length and complexity requirements to counteract that, though there is a limit to the length and complexity that users will accept. Administrators can also add delays or a lockout after unsuccessful authorization attempts to reduce the cracking risk. Logging of authorization failures can sometimes help too.

While Postgres records failed login attempts in the server logs, it doesn't provide any of the other administrative tools for password control. Administrators are expected to use an external authentication service like LDAP or PAM, which have password management features.

↧

Jonathan Katz: Scheduling Backups En Masse with the Postgres Operator

January 22, 2019, 8:27 am

≫ Next: Richard Yen: The Curious Case of Split WAL Files

≪ Previous: Bruce Momjian: Insufficient Passwords

An important part of running a production PostgreSQL database system (and for that matter, any database software) is to ensure you are prepared for disaster. There are many ways to go about preparing your system for disaster, but one of the simplest and most effective ways to do this is by taking periodic backups of your database clusters.

How does one typically go about setting up taking a periodic backup? If you’re running PostgreSQL on a Linux based system, the solution is to often use cron, and setting up a crontab entry similar to this in your superuser account:

# take a daily base backup at 1am to a mount point on an external disk
# using pg_basebackup
0 1 * * * /usr/bin/env pg_basebackup –D /your/external/mount/

However, if you’re managing tens, if not hundreds and thousands of PostgreSQL databases, this very quickly becomes an onerous task and you will need some automation to help you scale your disaster recovery safely and efficiently.

Automating Periodic Backups

The Crunchy PostgreSQL Operator, an application for managing PostgreSQL databases in a Kubernetes-based environment in is designed for managing thousands of PostgreSQL database from a single interface to help with challenges like the above. One of the key features of the PostgreSQL Operator is to utilize Kubernetes Labels to apply commands across many PostgreSQL databases. Later in this article, we will see how we can take advantage of labels in order to set backup policies across many clusters.

↧

Richard Yen: The Curious Case of Split WAL Files

January 22, 2019, 12:00 pm

≫ Next: Bruce Momjian: Synchronizing Authentication

≪ Previous: Jonathan Katz: Scheduling Backups En Masse with the Postgres Operator

Introduction

I’ve come across this scenario maybe twice in the past eight years of using Streaming Replication in Postgres: replication on a standby refuses to proceed because a WAL file has already been removed from the primary server, so it executes Plan B by attempting to fetch the relevant WAL file using restore_command (assuming that archiving is set up and working properly), but upon replay of that fetched file, we hear another croak: “No such file or directory.” Huh? The file is there? ls shows that it is in the archive. On and on it goes, never actually progressing in the replication effort. What’s up with that? Here’s a snippet from the log:

2017-07-19 16:35:19 AEST [111282]: [96-1] user=,db=,client=  (0:00000)LOG:  restored log file "0000000A000000510000004D" from archive
scp: /archive/xlog//0000000A000000510000004E: No such file or directory
2017-07-19 16:35:20 AEST [114528]: [1-1] user=,db=,client=  (0:00000)LOG:  started streaming WAL from primary at 51/4D000000 on timeline 10
2017-07-19 16:35:20 AEST [114528]: [2-1] user=,db=,client=  (0:XX000)FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000A000000510000004D has already been removed

scp: /archive/xlog//0000000B.history: No such file or directory
scp: /archive/xlog//0000000A000000510000004E: No such file or directory
2017-07-19 16:35:20 AEST [114540]: [1-1] user=,db=,client=  (0:00000)LOG:  started streaming WAL from primary at 51/4D000000 on timeline 10
2017-07-19 16:35:20 AEST [114540]: [2-1] user=,db=,client=  (0:XX000)FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000A000000510000004D has already been removed

scp: /archive/xlog//0000000B.history: No such file or directory
scp: /archive/xlog//0000000A000000510000004E: No such file or directory
2017-07-19 16:35:25 AEST [114550]: [1-1] user=,db=,client=  (0:00000)LOG:  started streaming WAL from primary at 51/4D000000 on timeline 10
2017-07-19 16:35:25 AEST [114550]: [2-1] user=,db=,client=  (0:XX000)FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000A000000510000004D has already been removed

What’s going on?

I haven’t personally been able to reproduce this exact scenario, but we’ve discovered that this happens when a WAL entry is split across two WAL files. Because some WAL entries will span two files, Postgres Archive Replay doesn’t internally know that it needs both files to successfully replay the event and continue with streaming replication. In the example above, an additional detail would be to look at the pg_controldata output, which in this case looks something like this: Minimum recovery ending location: 51/4DFFED10. So when starting up the standby server (after a maintenance shutdown or other similar scenario), it clearly needs file 0000000A000000510000004D to proceed, so it attempts to fetch its contents from the archive and replay it. It happily restores all the relevant WAL files until it reaches the end of the 51/4D file, at which point it can no longer find more WAL files to replay. The toggling mechanism kicks in, and it starts up the walsender and walreceiver processes to perform streaming replication.

When walreceiver starts up, it inspects the landscape and sees that it needs to start at 51/4DFFED10 for streaming replay, so it asks walsender to fetch the contents of 0000000A000000510000004D and send it. However, it’s been a long time (maybe lots of traffic or maybe a wal_keep_segments misconfiguration) and that 0000000A000000510000004D file’s gone. Neither the walsender or the walreceive know that LSN 51/4DFFED10 doesn’t actually exist in 0000000A000000510000004D, but it’s actually in 0000000A000000510000004E, and 0000000A000000510000004D is already gone, so it wouldn’t be able to scan it and find out that 0000000A000000510000004E is needed.

A possible solution

In one of the cases I worked on, the newer file (0000000A000000510000004E in the above example) had not been filled up yet. It turned out that it was a low-traffic development environment, and the customer had simply needed to issue a pg_switch_xlog() against the primary server.

Of course, this isn’t a very reliable solution, since it requires human intervention. In the end, the more reliable solution was to use a replication slot, so that Postgres always holds on to the necessary WAL files and doesn’t move/delete them prematurely. While streaming replication slots have their pitfalls, when used properly they will ensure reliable replication with minimal configuration tweaks.

Special thanks to Andres Freund and Kuntal Ghosh for helping with the analysis and determining the solution

↧

Bruce Momjian: Synchronizing Authentication

January 23, 2019, 9:00 am

≫ Next: Robert Haas: Who Contributed to PostgreSQL Development in 2018?

≪ Previous: Richard Yen: The Curious Case of Split WAL Files

I have already talked about external password security. What I would like to talk about now is keeping an external-password data store synchronized with Postgres.

Synchronizing the password is not the problem (the password is only stored in the external password store), but what about the existence of the user. If you create a user in LDAP or PAM, you would like that user to also be created in Postgres. Another synchronization problem is role membership. If you add or remove someone from a role in LDAP, it would be nice if the user's Postgres role membership was also updated.

ldap2pg can do this in batch mode. It will compare LDAP and Postgres and modify Postgres users and role membership to match LDAP. This email thread talks about a custom solution then instantly creates users in Postgres when they are created in LDAP, rather than waiting for a periodic run of ldap2pg.

↧

Robert Haas: Who Contributed to PostgreSQL Development in 2018?

January 23, 2019, 10:57 am

≫ Next: Jeff McCormick: What's New in Crunchy PostgreSQL Operator 3.5

≪ Previous: Bruce Momjian: Synchronizing Authentication

This is my third annual post on who contributes to PostgreSQL development. I have been asked a few times to include information on who employs these contributors, but I have chosen not to do that, partly but not only because I couldn't really vouch for the accuracy of any such information, nor would I be able to make it complete. The employers of several people who contributed prominently in 2018 are unknown to me.
Read more »

↧