Andrew Dunstan: pgbouncer enhancements

May 2, 2014, 7:20 am

≫ Next: Hubert 'depesz' Lubaczewski: Waiting for 9.4 – Add support for wrapping to psql’s “extended” mode.

≪ Previous: Josh Berkus: New Finding Unused Indexes Query

A couple of customers have recently asked for enhancements of pgbouncer, and I have provided them.

One that's been working for a while now, puts the address and port of the actual client (i.e. the program that connects to the proxy) into the session's application_name setting. That means that if you want to see where the client is that's running some query that's gone rogue, it's no longer hidden from you by the fact that all connections appear to be coming from the pgbouncer host.You can see it appearing in places like pg_stat_activity.

It only works when a client connects, so if the client itself sets application_name then the setting gets overridden. But few clients do this, and the original requester has found it useful. I've submitted this to the upsteam repo, as can be seen at https://github.com/markokr/pgbouncer-dev/pull/23.

The other enhancement is the ability to include files in the config file. This actually involves a modification to the library pgbouncer uses as a git submodule, libusual. With this enhancement, a line that has "%include filename" causes the contents of that file to be included in place of the directive. Includes can be nested up to 10 deep. The pull request for this is at https://github.com/markokr/libusual/pull/7. This one too sems to be working happily at the client's site.

There is one more enhancement on the horizon, which involves adding in host based authentication control similar to that used by Postgres. That's a rather larger bit of work, but I hope to get to it in the next month or two.

↧

Hubert 'depesz' Lubaczewski: Waiting for 9.4 – Add support for wrapping to psql’s “extended” mode.

May 5, 2014, 11:39 am

≫ Next: Tim van der Linden: PostgreSQL: A full text search engine - Part 2

≪ Previous: Andrew Dunstan: pgbouncer enhancements

On 28th of April, Greg Stark committed patch: Add support for wrapping to psql's "extended" mode. This makes it very feasible to display tables that have both many columns and some large data in some columns (such as pg_stats). Emre Hasegeli with review and rewriting from Sergey Muraviov and reviewed by Greg Stark […]

↧

Tim van der Linden: PostgreSQL: A full text search engine - Part 2

May 7, 2014, 6:00 am

≫ Next: Craig Kerstiens: Postgres Datatypes – The ones you're not using.

≪ Previous: Hubert 'depesz' Lubaczewski: Waiting for 9.4 – Add support for wrapping to psql’s “extended” mode.

Welcome to the second installment of our look into full text search within PostgreSQL.

If this is the first time you heard about full text search I highly encourage you to go and read the first chapter in this series before continuing. This chapter builds on what we have seen previously.

A look back

In short, the previous chapter introduced the general concept of full text search, regardless of the software being used. It looked at how the idea of full text search was brought to computer software by breaking it up into roughly three steps: case removal, stop word removal, normalizing with synonyms and stemming.

Next we delved into PostgreSQL's implementation and introduced the tsvector and the tsquery as two new data types together with a handful of new functions such as to_tsvector(), to_tsquery() and plainto_tsquery(), which all extend PostgreSQL to support full text search.

We saw how we could feed PostgreSQL a string of text which would then get parsed into tokens and processed even further into lexemes which in turn got stored into a tsvector. We then queried that tsvector using the tsquery data type and the @@ matching operator.

In this chapter, I want to flesh out an important topic we touched on in previously: PostgreSQL's full text search configurations.

Precaution

Let me be very clear, in most cases the configurations shipped with PostgreSQL will suffice and you do not need to touch them at all, in which case this chapter could be considered a waste of time.

However, I highly encourage you to read through this chapter and, as always, actually run the queries with me. You need to know how the tools you use work under the hood.

To be even more bold, someday you might even need to get your hands dirty and actually build you own configuration. Why? Because a customer wanting full text search for their application might have specific requirements, or even deliver you specific dictionaries to use in the parsing stage. Such use cases may arise in very specific areas of conduct where much official, technical lingo is used which is not covered in a general dictionary.

So, put on your favorite pants (or none if you like that better), turn down the lights, pull the computer close to you, open up a terminal window, put on some eery music and let us get started.

Configuring PostgreSQL full text search

In the last chapter we saw that PostgreSQL uses a couple of tools like a stop word list and dictionaries to perform its parsing. We also saw that we did not need to tell PostgreSQL about which of these tools to use. It turned out that full text search comes with a set of default configurations for several languages. We also found out that, if no configuration was given, the database assumes that the document or string to be parsed is English and uses a configuration called 'english'.

Beware of localized packages of PostgreSQL though. As I noted in the previous chapter, there is a small possibility that the default configuration in your PostgreSQL installation is not set to 'english'. If this is the case with your setup, be sure to include the 'english' configuration if not stated otherwise or change it to be 'english'. We will see how to do that in a minute.

Taking the small database we created last time, the syntax to feed a configuration set to PostgreSQL during parsing was the following:

SELECTto_tsquery('english','elephants & blue');

The string 'english' represents the name of the configuration which we would like to use. As you know by now, this string can be omitted which will make the database use the default configuration. PostgreSQL knows this default because it is set in the general postgresql.conf configuration file. In that file you will find a variable called default_text_search_config which, in most cases, is set to pg_catalog.english. If you wish to have a own, custom configuration to be the default, that is the place to set it.

Before hacking away at your own configuration, it may be of interest to see what PostgreSQL has to offer. To see which shipped configuration files are available to you, use the describe command (\d) together with the full text flag (F):

\dF

This will describe the objects in the database that represent full text configurations. You see that by default you have quite a lot of language support. To see a different configuration in action, let us do a quick, fun test.

First, take the dutch string "Een blauwe olifant springt al dartelend over de kreupele dolfijn.", which is a rough translation of the "The blue elephant jumps over the crippled dolphin." example from the first chapter. If we would feed this to PostgreSQL, using the default (english) configuration:

SELECTto_tsvector('english','Een blauwe olifant springt al dartelend over de kreupele dolfijn');

We would get back:

'al':5'blauw':2'dartelend':6'de':8'dolfijn':10'een':1'kreupel':9'olif':3'springt':4

It attempted to guess some words as you can see from the lexeme 'olif', but, to a dutch reader, this is not stemmed correctly. Neither are the stop words removed: 'de' and 'een' are articles which, in dutch, are considered of no value in a text search context. So let us try this again with the built-in dutch configuration:

SELECTto_tsvector('dutch','Een blauwe olifant springt al dartelend over de kreupele dolfijn');

And we get:

'blauw':2'dartel':6'dolfijn':10'kreupel':9'olifant':3'springt':4

Aha! That is much shorter then the previous result, and it is also more correct. As you can see, the words 'de' and 'een' are now removed and the stemming is done correctly on 'dartel', 'olifant' and 'kreupel'. The target of this series, however, is not to show you the dutch language (for it will make you weep...), but you see the effect a different configuration set can have.

But what is such a configuration set made of? To answer that, we can simply use the same describe, but ask for more detailed information with the + flag:

\dF+

This will return a list of all the configurations and their details, so let us filter that and look at only the english version:

\dF+english

The following result will be returned:

 asciihword      | english_stem
 asciiword       | english_stem
 email           | simple
 file            | simple
 float           | simple
 host            | simple
 hword           | english_stem
 hword_asciipart | english_stem
 hword_numpart   | simple
 hword_part      | english_stem
 int             | simple
 numhword        | simple
 numword         | simple
 sfloat          | simple
 uint            | simple
 url             | simple
 url_path        | simple
 version         | simple
 word            | english_stem

All of these are token categories that target the different groups of words that the PostgreSQL full text parser recognizes. For each category there are one or more dictionaries defined which will receive the token and try to return a lexeme. We also call this overview a configuration map, for it maps a category to one or more dictionaries.

If the parser encounters a URL, for example, it will categorize it as a url or url_path token and as a result, PostgreSQL will consult the dictionaries mapped to this category to try and create a single lexeme containing a URL pointing to the same path. Example:

example.com
example.com/index.html
example.com/foo/../index.html

The URLs all result in the same document being served, so it makes sense to only save one variant as a lexeme in the resulting vector. The same kind of normalization is done for file paths, version numbers, host names, units of measure, ... . A lot more then normal, English words.

There are 23 categories in total that the parser can recognize, ones not included here, for example, are tag for XML tags, blank for whitespace or punctuation, etc.

To see a description of the different token categories supported, use the 'p' flag together with '+' for more information:

\dFp+

When parsing, the chain of command goes as follows:

A string is fed to PostgreSQL's full text
The parser crawls over the string and chops it into tokens of a certain type
For each token category a list of dictionaries (or a single dictionary) is consulted
If a dictionary list is used, the dictionaries are (generally) ordered from most precise (narrow) to most generic (wide)
As soon as a dictionary returns a lexeme (single or in the form of an array), the flow for that token stops
If no lexeme is proposed (a dictionary returns NULL) the token is given to the next dictionary in line or if a stop word list returns a match (returns empty array), the token is discarded

Dictionary templates and dictionaries

In the list of token categories that where supported by the built-in 'english' configuration, you will find that only two dictionaries are used: simple and english_stem, which in turn come from the simple and snowball dictionary templates respectively.

So, what exactly is the difference between a dictionary template and a dictionary?

A dictionary template is the skeleton (hence template) of a dictionary. It defines the actual C functions that will do the heavy lifting. A dictionary is an instantiation of that template - providing it with data to work with.

Let me try to clear any confusion on this.

Take, for example, the simple dictionary template. It does two things: it first checks a token against a stop word list. If it finds a match it returns an empty array, which will result in the token being discarded. If no match is found in the stop word list, the process will return the same token, but with casing removed.

All the checking and case removing is done by functions, under the hood. The stop word file, however, is something that the dictionary (the instantiation) provides. The instantiation of the simple dictionary template, thus the dictionary itself, would be defined as follows:

CREATETEXTSEARCHDICTIONARYsimple(template=simple,stopwords=english);

No need to run this SQL for PostgreSQL already comes shipped with the simple dictionary, but I wish to show you how you could create it.

First, you will see that we have to define the template, thus telling PostgreSQL which set of functions to use. Next we feed it the data it is expecting, in case of simple it only expects a stop word list.

The reason for this separation is a safe guard one. Only a database user with super user privileges can write the actual template, because this template will contain functions that, if written incorrectly, could slow down or crash the database. You need someone who knows what they are doing and not your local script kiddy who has normal user access to your part of the database.

Notice that we only give the stopwords attribute the word english instead of a full file path. This is because PostgreSQL has set a few standards in place for all dictionary types we will see in this chapter.

First, in case of a stop word list, the file must have the .stop extension.

Next, you can provide a full path to the file, anywhere on your system. However, if you do not provide a full path, PostgreSQL will search for it inside a directory called tsearch_data within PostgreSQL's portion of your system's user shared directory.

On a Debian system (using PostgreSQL 9.3) the path to this directory reads: "/usr/share/postgresql/9.3/tsearch_data".

A dictionary like the simple dictionary is one that is most of the time put at the beginning of a dictionary list to remove all the stop words before other dictionaries are being consulted. However, in all the cases where we see simple in the dictionary column of the token type list above, only this dictionary is used, meaning that only stop words are removed and all else is stripped of casing.

Creating the "simple" dictionary

Say that we wanted to setup our own simple dictionary based on the simple dictionary template, but feed it our own list of stop words. Before setting up this new dictionary, we would first have to write a stop word file.

Luckily for us, this is trivial. A stop word file is nothing more then a plain text file with one word on each line. Empty lines and trailing whitespace are ignored. We would then have to save this file with the .stop extension. Let us try just that.

Open up your editor and punch in the words "dolphin" and "the", both on their own line. Write the file out as "shisaa_stop.stop", preferably in PostgreSQL's shared directory.

Next we need to setup our dictionary. Connect to the "phrases" database from chapter one and run the following SQL:

CREATETEXTSEARCHDICTIONARYshisaa_simple(template=simple,stopwords=shisaa_stop);

Setting up a configuration

Now, the dictionary by itself is not very helpful. As we have seen before, we need to map it to token categories before we can actually use it for parsing. This means that we need to make our own configuration.

Let us setup an empty configuration (not based on an existing one like 'english'):

CREATETEXTSEARCHCONFIGURATIONshisaa(parser='default');

This statement will create a new configuration for us which is completely empty, it has no mappings. The argument we have to give here can be either parser or copy. With parser you define which parser to use and it will create an empty configuration. PostgreSQL has only one parser by default which is named...default. If you choose copy then you will have to provide an existing configuration name (like english) from which you would like to make a copy.

To verify that the configuration is empty, run our describe on it:

\dF+shisaa

And marvel at its emptiness.

Now, let us add the shisaa_simple dictionary we created before:

ALTERTEXTSEARCHCONFIGURATIONshisaaALTERMAPPINGFORasciiword,asciihword,hword_asciipart,word,hword,hword_partWITHshisaa_simple;

As you will see throughout this (and the next) chapter, full text extends not only the data types and functions we have available, but also extends PostgreSQL's SQL syntax with a handful of new statements. I need to note that all of these statements are not SQL standard (for SQL has no full text standard) and thus cannot be easily ported to a different database. But then again...what is this folly...who would even need a different database!

The new statements introduced here (and in the previous SQL blocks) are:

CREATE TEXT SEARCH DICTIONARY
CREATE TEXT SEARCH CONFIGURATION
ALTER TEXT SEARCH CONFIGURATION

Just remember that these are not part of the SQL standard (something which PostgreSQL holds very dear, in high contrast with many other database).

Did it work? Well, describe it:

\dF+shisaa

And get back:

 asciihword      | shisaa_simple
 asciiword       | shisaa_simple
 hword           | shisaa_simple
 hword_asciipart | shisaa_simple
 hword_part      | shisaa_simple
 word            | shisaa_simple

Perfect!

Here we mapped our fresh dictionary to the token groups "asciihword", "asciiword", "hword", "hword_asciipart", "hword_part", "word", because these will target most of a normal, English sentence.

It is time to try out this new search configuration! Punch in the same on-the-fly SQL as we had in the previous chapter, but this time with our own configuration:

SELECTto_tsvector('shisaa','The big blue elephant jumped over the crippled blue dolphin.');

And we get back:

'big':2 'blue':3,9 'crippled':8 'elephant':4 'jumped':5 'over':6

Ha! All squeaky flippers unite! The word dolphin is removed, because we defined it to be a stop word. A world as it should be.

We now have a basic full text configuration with a simple dictionary. To have a more real world full text search we will need more then just this dictionary though, we will at least need to take care of stemming.

Extending the configuration: stemming with the Snowball

Stemming, the process of reducing words to their basic form, is done by a special, dedicated kind of dictionary, the Snowball dictionary.

What?

Snowball is a very proven string processing language specially designed for stemming purposes and supports a wide range of languages. It originated from the Porter stemming algorithm and uses a natural syntax to define stemming rules.

And luckily for us, PostgreSQL has a Snowball dictionary template ready to use. This template has the Snowball stemming rules embedded for a wide variety of languages. Let us create a dictionary for our shisaa configuration, shall we?

CREATETEXTSEARCHDICTIONARYshisaa_snowball(template=snowball,language=english);

Again, very easy to setup. The snowball dictionary template accepts two variables to be setup. The first, mandatory one is the language you wish to support. Without this, the template does not know which of the Snowball stemming rules to take.

The next, optional one is, again, a stop word list. But...why can we feed this dictionary a stop word list? Did we not already do that with the simple dictionary?

That is correct, we did setup the simple dictionary to remove stop words for us, but we are not required to use the simple and the snowball dictionary in tandem. It is perfectly possible to map only the snowball dictionary for various token categories and ignore all other dictionaries. If you would not tell the snowball dictionary to remove stop words, it could become messy for the Snowball stemmer will try and stem all words it finds.

This stop word list can be the exact same list we fed the simple dictionary.

Also, because a snowball dictionary will try and parse all the tokens it is being fed, it is consider to be a wide dictionary. Therefor, as we have seen earlier, it is a good idea when chaining dictionaries together to put this dictionary at the end of your chain.

We now have our own version of the snowball dictionary and need to extend our configuration and map this dictionary to the desired token categories:

ALTERTEXTSEARCHCONFIGURATIONshisaaALTERMAPPINGFORasciiword,asciihword,hword_asciipart,word,hword,hword_partWITHshisaa_simple,shisaa_snowball;

Notice that in the WITH clause we are now chaining the simple and the snowball dictionary together. The order is, of course, important. Describe our configuration once more:

\dF+shisaa

And get back:

asciihword      | shisaa_simple,shisaa_snowball
asciiword       | shisaa_simple,shisaa_snowball
hword           | shisaa_simple,shisaa_snowball
hword_asciipart | shisaa_simple,shisaa_snowball
hword_part      | shisaa_simple,shisaa_snowball
word            | shisaa_simple,shisaa_snowball

Perfect, now the simple dictionary will be consulted first followed by the snowball dictionary.

Note that throughout this chapter I will chain together dictionaries in order. This will not always be the most smart or desired order, just an order to demonstrate how you can chain dictionaries.

To the test, throw a new query at it:

SELECTto_tsvector('shisaa','The big blue elephant jumped over the crippled blue dolphin.');

And get back:

'big':2 'blue':3,9 'crippled':8 'elephant':4 'jumped':5 'over':6

Nice, that is very...oh wait. Something is not correct. I am getting back exactly the same result as before. The words "crippled" and "elephant" are not stemmed at all. Why?

Well, the simple dictionary, as we defined it earlier, is setup to be a bit greedy. In its current state it will return an unmatched token as a lexeme with casing removed. It does not return NULL. And, as you know by now, NULL is needed to give other dictionaries a chance to examine the token.

So, we need to alter the simple dictionary's behavior. For this, we can use the ALTER syntax provided to us. And as it turns out, the simple dictionary template can accept one more variable: the accept variable. If this is set to false, then it will return NULL for every unmatched token. Let us alter that dictionary:

ALTERTEXTSEARCHDICTIONARYshisaa_simple(accept=false);

Run the ts_vector query again, and look at the results:

'big':2 'blue':3,9 'crippl':8 'eleph':4 'jump':5 'over':6

That is what we were looking for, nicely stemmed results!

Extending the configuration: fun with synonyms

By now we have seen the first and the last dictionary in our control chain, but at least one more important part is missing: synonyms are not removed.

Let us extend our favorite sentence and add a few synonyms to it: "The big blue elephant, joined by its enormous blue mammoth friend, jumped over the crippled blue dolphin while smiling at the orca."

Still perfectly possible.

In the light of (cue dark en deep Batman voice) "science" (end Batman voice), let us first see what we get when we run it through our current configuration:

'at':20 'big':2 'blue':3,9,16 'by':6 'crippl':15 'eleph':4 'enorm':8 'friend':11 'it':7 'join':5 'jump':12 'mammoth':10 'orca':22 'over':13 'smile':19 'while':18

That is one big result set. Maybe we should cut the blue dolphin a little bit of slack and feed a real stop word list to our simple dictionary before continuing by altering our dictionary:

ALTERTEXTSEARCHDICTIONARYshisaa_simple(stopwords=english);

As you see you can simply use the same ALTER syntax as before. The "english" here refers to the shipped "english.stop" stop word list.

Querying again, we will get back a better, short list (including our Dolphin friend):

'big':2 'blue':3,9,16 'crippl':15 'dolphin':17 'eleph':4 'enorm':8 'friend':11 'join':5 'jump':12 'mammoth':10 'orca':22 'smile':19

Now we would like to reduce this result even further by compacting synonyms into one lexeme.

Enter the synonym dictionary template.

This template requires you to have a so-called "synonym" file; A file containing lists of words with the same meaning. For the sake of learning, let us create our own synonym file. This file has to end with the .syn extension.

Open up your editor again and write out a file called "shisaa_syn.syn" with the following contents:

big enormous
elephant mammoth
dolphin orca

And let us setup the dictionary:

CREATETEXTSEARCHDICTIONARYshisaa_synonym(template=synonym,synonyms=shisaa_syn);

And add the mapping for it:

ALTERTEXTSEARCHCONFIGURATIONshisaaALTERMAPPINGFORasciiword,asciihword,hword_asciipart,word,hword,hword_partWITHshisaa_simple,shisaa_synonym,shisaa_snowball;

Okay, time to test our big string again and see the results:

'blue':3,9,16 'crippl':15 'enorm':8 'enormous':2 'friend':11 'join':5 'jump':12 'mammoth':4,10 'orca':17,22 'smile':19

Very neat. The words "elephant", "big" and "dolphin" are now removed and only their synonyms are kept. Also notice that both "mammoth" and "orca" have two pointers each, one for every synonym.

But look at the words 'enorm' and 'enormous', why is this happening?

If you look at the pointers, you see that enormous points to the second word in the string, being big, while enorm points to the original enormous word. The reason why this is happening is because our synonym dictionary has priority over our snowball one. The synonym dictionary emits a lexeme as a synonym for big, being enormous, simply because we told it to do so in our synonym file. Now, because it emits a lexeme, the original token, big, is not available anymore for the rest of the dictionary chain.

The token enormous itself has no synonym because we did not define it in our synonym file. It is ignored by the synonym dictionary and passed over to the snowball dictionary which then stems the token into a lexeme resulting in enorm.

If you wish to prevent this from happening, you could add a self pointing line to your synonym list:

enormous enormous

Now load in the file on disk to pull the changes into PostgreSQL:

ALTERTEXTSEARCHDICTIONARYshisaa_synonym(synonyms=shisaa_syn);

And run the query again, the result should now read:

'blue':3,9,16 'crippl':15 'enormous':2,8 'friend':11 'join':5 'jump':12 'mammoth':4,10 'orca':17,22 'smile':19

Now enorm will be removed and both big and enormous are cast to the same lexeme.

PostgreSQL does not ship a synonym list, so you will have to compile your own just like we did above but hopefully a little bit more useful

Extending the configuration: phrasing with a Thesaurus

Next up is the thesaurus dictionary, which is quite close to the synonym dictionary, with one exception: phrases.

A thesaurus dictionary is used to recognize phrases and convert them into lexemes with the same meaning. Again, this dictionary relies on a file containing the phrase conversions. This time, the file has the .ths extension.

Open up your editor and write out a file called "shisaa_thesaurus.ths" with the following contents:

big blue elephant : PostgreSQL
crippled blue dolphin : MySQL

Before we can create the dictionary, there is one more required variable we have to set, the subdictionary the thesaurus dictionary can use. This subdictionary will be another dictionary you have defined before. Most of the time a stemmer is fed to this variable to let the thesaurus stem the input before comparing it with its thesaurus file.

So let us feed it our snowball dictionary and set it up:

CREATETEXTSEARCHDICTIONARYshisaa_thesaurus(TEMPLATE=thesaurus,DICTFILE=shisaa_thesaurus,DICTIONARY=shisaa_snowball);

Map it:

ALTERTEXTSEARCHCONFIGURATIONshisaaALTERMAPPINGFORasciiword,asciihword,hword_asciipart,word,hword,hword_partWITHshisaa_simple,shisaa_thesaurus,shisaa_snowball;

Notice that I took out the synonym dictionary. If we chain up to many dictionaries, the results might turn out to be undesirable in our demonstration use case.

Querying will result in the following tsvector:

'blue':7 'enorm':6 'friend':9 'join':3 'jump':10 'mammoth':8 'mysql':13 'orca':18 'postgresql':2 'smile':15

That is quite awesome, it now recognizes "big blue elephant" as PostgreSQL and "crippled blue dolphin" as MySQL. We have created a pun-aware full text search configuration!

As you can see, both the "MySQL" and "PostgreSQL" lexemes have one pointer each, pointing to the first word of the substring that got converted.

Extending the configuration a last time: morphing with Ispell

Okay, we are almost at the end of the dictionary templates that PostgreSQL supports.

This last one is a fun one too. Many Unix and Linux systems come shipped with a spell checker called Ispell or with the more modern variant called HunSpell. Besides your average spell checking, these dictionaries are very good at morphological lookups, meaning that they can link all different writing structures of words together.

A synonym or thesaurus dictionary would not catch these, unless explicitly set with a huge amount of lines in the .syn or .ths files, which is error prone and inelegant. The Ispell or Hunspell dictionaries will capture these and try to make them into one lexeme.

Before setting up the dictionary, we first need to make sure that we have the Ispell or Hunspell dictionary files for the language we wish to support. Normally you would want to download these files from the official OpenOffice page. These pages, however, seem to be confusing and the correct files very hard to find. I have found the following page of great help to get the files you need for your desired language . Download the files for your desired language and place the .dict and the .affix files into the PostgreSQL shared directory.

For now, let us just take the basic english dict and affix files (named both en_us and already shipped with PostgreSQL) and feed them to the configuration:

CREATETEXTSEARCHDICTIONARYshisaa_ispell(template=ispell,DictFile=en_us,AffFile=en_us,StopWords=english);

And chain it:

ALTERTEXTSEARCHCONFIGURATIONshisaaALTERMAPPINGFORasciiword,asciihword,hword_asciipart,word,hword,hword_partWITHshisaa_simple,shisaa_ispell,shisaa_snowball;

Notice again I took out the thesaurus dictionary, not to pile up too many dictionaries at once.

Query it once more, and look at what we get back:

'big':2 'blue':3,9,16 'cripple':15 'dolphin':17 'elephant':4 'enormous':8 'friend':11 'join':5 'joined':5 'jump':12 'mammoth':10 'orca':22 'smile':19 'smiling':19

Hmm, interesting. Notice that we now got more lexemes than before, smile and smiling for example, and join and joined. Also, both these cases have the same pointer. Why is that?

What is happening here is a feature of the Ispell dictionary called morphology, or as we seen above, morphological lookups. One of the reasons why Ispell is such a powerful dictionary is because it can recognize and act upon the structure of a word.

In our case, Ispell recognizes joined (or smiling) and emits an array of two lexemes, the original token converted to a lexeme and the stemmed version of the token.

This concludes all the dictionaries that PostgreSQL ships with by default and the ones you will most likely ever need. What is next?

Debugging

Now that you have a good understanding of how to build your own configuration and setup your own dictionaries, I would like to introduce a few new functions that can come in handy when your configuration would produce seemingly strange results.

ts_debug()

The first function I want show you is a very handy one that is built to test your whole full text configuration. It helps you keep your mental condition to just mildly insane, so to speak.

The function ts_debug() accepts a configuration and a string of text you wish to test. As a result you will get back a set that contains an overview of how the parser chopped your string into tokens, which category it picked for each token, which dictionary was consulted and which lexeme(s) where emitted. Oh boy, this is too much fun, let us just try it out!

Feed our original pun string and let us test the current shisaa configuration:

SELECTts_debug('shisaa','The big blue elephant jumped over the crippled blue dolphin.');

Hmm, that may not be very readable, rather use the wildcard selector and a FROM clause to include column names into our result set (one of the few times you may use this selector without getting smacked):

SELECT*FROMts_debug('shisaa','The big blue elephant jumped over the crippled blue dolphin.');

Which will result in the following, huge set:

alias   |   description   |  token   |                 dictionaries                  |  dictionary   |  lexemes   
-----------+-----------------+----------+-----------------------------------------------+---------------+------------
asciiword | Word, all ASCII | The      | {shisaa_simple,shisaa_ispell,shisaa_snowball} | shisaa_simple | {}
blank     | Space symbols   |          | {}                                            |               | 
asciiword | Word, all ASCII | big      | {shisaa_simple,shisaa_ispell,shisaa_snowball} | shisaa_ispell | {big}
blank     | Space symbols   |          | {}                                            |               | 
asciiword | Word, all ASCII | blue     | {shisaa_simple,shisaa_ispell,shisaa_snowball} | shisaa_ispell | {blue}
blank     | Space symbols   |          | {}                                            |               | 
asciiword | Word, all ASCII | elephant | {shisaa_simple,shisaa_ispell,shisaa_snowball} | shisaa_ispell | {elephant}
blank     | Space symbols   |          | {}                                            |               | 
asciiword | Word, all ASCII | jumped   | {shisaa_simple,shisaa_ispell,shisaa_snowball} | shisaa_ispell | {jump}
blank     | Space symbols   |          | {}                                            |               | 
asciiword | Word, all ASCII | over     | {shisaa_simple,shisaa_ispell,shisaa_snowball} | shisaa_simple | {}
blank     | Space symbols   |          | {}                                            |               | 
asciiword | Word, all ASCII | the      | {shisaa_simple,shisaa_ispell,shisaa_snowball} | shisaa_simple | {}
blank     | Space symbols   |          | {}                                            |               | 
asciiword | Word, all ASCII | crippled | {shisaa_simple,shisaa_ispell,shisaa_snowball} | shisaa_ispell | {cripple}
blank     | Space symbols   |          | {}                                            |               | 
asciiword | Word, all ASCII | blue     | {shisaa_simple,shisaa_ispell,shisaa_snowball} | shisaa_ispell | {blue}
blank     | Space symbols   |          | {}                                            |               | 
asciiword | Word, all ASCII | dolphin  | {shisaa_simple,shisaa_ispell,shisaa_snowball} | shisaa_ispell | {dolphin}
blank     | Space symbols   | .        | {}                                            |               |

You now have a complete overview of the flow from string to vector of lexemes. Let me go over some interesting facts of this result set.

First, notice how the tokens the and over got removed by the simple dictionary. They where a hit in the stop word list, so the dictionary returned an empty array.

Next you see the alias blank between each asciiword. Blank is a category used for spaces or punctuation. A space and a . (full stop) is considered a token, but is stripped out by the parser itself for it has no value in this context.

And last, see that our snowball dictionary was never consulted. This means that, in this string, the shisaa_ispell gobbled all the lexemes that shisaa_simple threw at it.

ts_lexize()

The second function is ts_lexize(). This little helper lets you test different parts of your whole setup. Take the unexpected result of our last dictionary, where we got back multiple lexemes. As it turned out it is normal behavior, but you may want to verify that the result is coming from the dictionary and not from a side effect of how you chained your dictionaries together.

To test our single, shisaa_ispell dictionary, we could feed it to this new function, together with one token we wish to test:

SELECTts_lexize('shisaa_ispell','joined');

This will return:

{joined,join}

Same as we had before, but now we know, for sure, that it is a feature of our Ispell dictionary. Notice that I stressed the fact that you can only feed this function one token, not a string of text and not multiple tokens.

You can use this function to test all your dictionaries individually, one token at a time.

Phew, that was a lot to take in for we covered a lot of ground here today. You can turn the lights back high and go get some fresh air. In the next chapter, I will round up this introduction by introducing you to the following, new material:

Ranking search results
Highlighting word inside search results
Creating special, full text search indexes
Setting up update triggers for ts_vector records

And as always...thanks for reading!

↧

Craig Kerstiens: Postgres Datatypes – The ones you're not using.

May 7, 2014, 12:00 am

≫ Next: Jim Mlodgenski: Trigger Overhead

≪ Previous: Tim van der Linden: PostgreSQL: A full text search engine - Part 2

Postgres has a variety of datatypes, in fact quite a few more than most other databases. Most commonly applications take advantage of the standard ones – integers, text, numeric, etc. Almost every application needs these basic types, the rarer ones may be needed less frequently. And while not needed on every application when you do need them they can be an extremely handy. So without further adieu let’s look at some of these rarer but awesome types.

hstore

Yes, I’ve talked about this one before, yet still not enough people are using it. Of this list of datatypes this is one that could also have benefit for most if not all applications. Hstore is a key-value store directly within Postgres. This means you can easily add new keys and values (optionally), without haveing to run a migration to setup new columns. Further you can still get great performance by using Gin and GiST indexes with them, which automatically index all keys and values for hstore.

It’s of note that hstore is an extension and not enabled by default. If you want the ins and outs of getting hands on with it, give the article on Postgres Guide a read.

Range types

If there is ever a time where you have two columns in your database with one being a from, another being a to, you probably want to be using range types. Range types are just that a set of ranges. A super common use of them is when doing anything with calendaring. The place where they really become useful is in their ability to apply constraints on those ranges. This means you can make sure you don’t have overlapping time issues, and don’t have to rebuild heavy application logic to accomplish it.

Timestamp with Timezone

Timestamps are annoying, plain and simple. If you’ve re-invented handling different timezones within your application you’ve wasted plenty of time and likely done it wrong. If you’re using plain timestamps within your application further there’s a good chance they dont even mean what you think they mean. Timestamps with timezone or timestamptz automatically includes the timezone with the timestamp. This makes it easy to convert between timezones, know exactly what you’re dealing with, and will in short save you a ton of time. There’s seldom a case you shouldn’t be using these.

UUID

Integers are primary keys aren’t great. Sure if you’re running a small blog they work fine, but if you’re application has to scale to a large size integers can create problems. First you can run out of them, second it can make other details such as sharding a little more annoying. At the same time they are super readable. However, using the actual UUID datatype and extension to automatically generate them can be incredibly handy if you have to scale an application.

Similar to hstore, there’s an extension that makes the UUID much more useful.

Binary JSON

This isn’t available yet, but will be in Postgres 9.4. Binary JSON is of course JSON directly within your database, but also lets you add Gin indexes directly onto JSON. This means a much simpler setup in not only inserting JSON, but having fast reads. If you want to learn a bit more about this, sign up to get notified of training regarding the upcoming PostgreSQL 9.4 release.

Money

Please don’t use this… The money datatype assumes a single currency type, and generally brings with it more caveats than simply using a numeric type.

In conclusion

What’d I miss? What are you’re favorite types? Let me know @craigkerstiens, or sign-up below to updates on Postgres content and first access to training.

↧

Jim Mlodgenski: Trigger Overhead

May 7, 2014, 9:38 am

≫ Next: Josh Berkus: Why you should always set temp_file_limit

≪ Previous: Craig Kerstiens: Postgres Datatypes – The ones you're not using.

I recently had discussions with some folks about triggers in PostgreSQL. They had two main questions.

What is the overhead of putting a trigger on a table?
Should a trigger function be generic with IF statements to do different things for INSERT, UPDATE and DELETE?

So I created a simple test to verify some assumptions.

First, I created a simple table and made it UNLOGGED. I didn’t want the overhead of the WAL to possibly dwarf the timings of the triggers.

CREATE UNLOGGED TABLE trigger_test (
 key serial primary key, 
 value varchar, 
 insert_ts timestamp, 
 update_ts timestamp
);

I then create two scripts to push through pgbench and get some timings.

INSERTS.pgbench

INSERT INTO trigger_test (value) VALUES (‘hello’);
UPDATES.pgbench

\set keys :scale
\setrandom key 1 :keys
UPDATE trigger_test SET value = 'HELLO' WHERE key = :key;

I ran these with the following pgbench commands:
pgbench -n -t 100000 -f INSERTS.pgbench postgres
pgbench -n -s 100000 -t 10000 -f UPDATES.pgbench postgres

The result is that I created 100,000 rows in the test table and then randomly updated 10,000 of them. I ran these commands several times with dropping and recreating the test table between each iteration and the average tps values I was seeing were:
Inserts: 4510 tps
Updates: 4349 tps
Then to get the overhead of a trigger, I created a trigger function that just returns. I then repeated the process of running the pgbench commands as before.

CREATE FUNCTION empty_trigger() RETURNS trigger AS $$
BEGIN
 RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER empty_trigger BEFORE INSERT OR UPDATE ON trigger_test
 FOR EACH ROW EXECUTE PROCEDURE empty_trigger();

The result with the empty trigger were:
Inserts: 4296 tps
Updates: 3988 tps

That results in a 4.8% overhead for inserts and an 8.3% overhead for updates. I didn’t dig further as to why is appears that the overhead for a trigger on an update is almost twice as high as on an insert. I’ll leave that to a follow-up when I have some more time. A 4%-8% overhead of placing a trigger on a table will likely not be noticed in most real-world applications, the overhead of what is executed inside the trigger function can be noticed, which led to the next topic.
I then wanted to see the overhead of having a single trigger function versus having separate trigger functions for inserts and updates.

For a single trigger function, I used the following:

CREATE FUNCTION single_trigger() RETURNS trigger AS $$
BEGIN
 IF (TG_OP = 'INSERT') THEN
 NEW.insert_ts = CURRENT_TIMESTAMP;
 RETURN NEW;
 ELSIF (TG_OP = 'UPDATE') THEN
 NEW.update_ts = CURRENT_TIMESTAMP;
 RETURN NEW;
 END IF;
 RETURN NULL; 
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER single_trigger BEFORE INSERT OR UPDATE ON trigger_test
 FOR EACH ROW EXECUTE PROCEDURE single_trigger();

And for separate trigger functions, I used:

CREATE FUNCTION insert_trigger() RETURNS trigger AS $$
BEGIN
 NEW.insert_ts = CURRENT_TIMESTAMP;
 RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE FUNCTION update_trigger() RETURNS trigger AS $$
BEGIN
 NEW.update_ts = CURRENT_TIMESTAMP;
 RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER insert_trigger BEFORE INSERT ON trigger_test
 FOR EACH ROW EXECUTE PROCEDURE insert_trigger();
CREATE TRIGGER update_trigger BEFORE UPDATE ON trigger_test
 FOR EACH ROW EXECUTE PROCEDURE update_trigger();

I then reran the same process as before to see the overhead.

Single Trigger Inserts: 3569 tps
Single Trigger Updates: 3450 tps

Separate Triggers Inserts: 3623 tps
Separate Triggers Updates: 3870 tps

It turns out that splitting the trigger function into separate functions does make a difference. For the insert trigger, keeping things as a single trigger only added 1.5% to the overhead, but for the update trigger, a single trigger function added nearly 11% overhead. The is most likely due to the update case being handle second in the trigger function. That’s another thing to dig into when there is time.

↧

Josh Berkus: Why you should always set temp_file_limit

May 7, 2014, 1:14 pm

≫ Next: robert berry: Querying Time Series in Postgresql

≪ Previous: Jim Mlodgenski: Trigger Overhead

"The database is growing at 2GB a minute. We're 40 minutes away from running out of disk space."

"Sounds like I should probably take a look."

I looked at the database size, which was 160GB. But the database SAN share was up to 1.4GB used out of 1.6GB. WTF?

Then I looked at the filesystem and did directory sizes. pg_tmp was over a terabyte. Oooooooohhh.

Apparently they'd accidentally pushed a new report to the application which worked OK in testing, but with certain parameters created a 15 billion item sort. And since it was slow, users called it several times. Ooops.

Enter temp_file_limit, a parameter added by Mark Kirkwood to PostgreSQL 9.2. This is a limit on per-session usage of temporary files for sorts, hashes, and similar operations. If a user goes over the limit, their query gets cancelled and they see an error.

This is an excellent way to prevent a single user, or a bad application change, from DOSing your database server. Set it to something high; I'm using 10GB or 20GB, or 10% of available disks space, whichever is less. But even a high limit like that will save you from some unexpected downtime.

↧

robert berry: Querying Time Series in Postgresql

May 8, 2014, 12:00 am

≫ Next: Shaun M. Thomas: Trumping the PostgreSQL Query Planner

≪ Previous: Josh Berkus: Why you should always set temp_file_limit

Querying Time Series in Postgresql

May 8, 2014 – Portland, OR

This post covers some of the features which make Postgresql a fun and effective database system for storing and analyzing time series: date functions, window functions, and series generating functions.

What are time series?

In a computation context, a time series is a sequence of measurements taken at discrete time intervals. Measurements can be taken every second, hour, day, or other arbitrary interval. The code below will generate some time series data on the activity in a Postgresql cluster by sampling pg_stat_activity every 10 seconds for 100 seconds.

-- add a table to store some time series data
create table activity_tseries (measured_at timestamptz, activity_count int);-- add a pl function to collect counts of active queries at 10 second intervals
create or replace function collect_activity() returns void AS $$
  begin
    for i in 1..10 loop
      insert into activity_tseries (measured_at, activity_count) values 
        (clock_timestamp(), (select count(*) from pg_stat_activity where state <> 'idle'));
      perform pg_sleep(10);
    end loop;
  end;
$$ language plpgsql;-- collect the data
select collect_activity();

Consider the output, a sequence of activity measurements at 10 second intervals. Note the selection of a timestamp and an associated metric.

postgres=# select measured_at, activity_count from activity_tseries;
          measured_at          | activity_count
-------------------------------+----------------
 2014-05-07 18:22:09.655861-07 |             11
 2014-05-07 18:22:19.664114-07 |             10
 2014-05-07 18:22:29.674501-07 |              3
 2014-05-07 18:22:39.676574-07 |              9
 2014-05-07 18:22:49.686977-07 |              5
 2014-05-07 18:22:59.697342-07 |              6
 2014-05-07 18:23:09.707722-07 |              4
 2014-05-07 18:23:19.70827-07  |              6
 2014-05-07 18:23:29.718338-07 |              4
 2014-05-07 18:23:39.719099-07 |              2
(10 rows)

Continuous, Discrete, and Granularity — Changes Over Time Domain

It’s kind of difficult to look at a large number of data points and attribute meaning. So it’s common to ask questions like “how many activities were observed by hour?” Maybe that would highlight a trend or pattern in activity. For this illustrative example we will consider answering the question how many activities were observed per minute.

In a database time series data is always similar to the above in that it has discrete intervals — it’s integers all the way down. While there are some rules that govern accuracy when changing the time domain, we’ll be safely changing the time domain to buckets of 1 minute.

This is where date_trunc comes into the picture. date_trunc is arguably the most important time function for this type of query. Used with a sum aggregate we can easily count activities by minutes.

postgres=# select date_trunc('minute', measured_at) as mins, sum(activity_count)
              from activity_tseries
              group by date_trunc('minute', measured_at)
              order by date_trunc('minute', measured_at) asc;          mins          | sum 
------------------------+-----
 2014-05-07 18:22:00-07 |  12
 2014-05-07 18:23:00-07 |   8

Representations of the a theoretical continuous sample, actual samples, the aggregated samples from this section are pictured below.

Quick Practical Example

Before moving on to some more interesting features, here’s a quick example that will be widely applicable and that will set the stage for the next section. Lets consider counting the number of new user records by day based on a created_at timestamp. These are some actual numbers from The Drift, in case you are curious about the adoption of a free Android application in its first months:

postgres=# select date_trunc('day', created_at) as day, count(*)
              from users
              group by date_trunc('day', created_at)
              order by date_trunc('day', created_at) asc;     date_trunc      | count 
---------------------+-------
 2014-01-05 00:00:00 |     1
 2014-01-13 00:00:00 |     1
 2014-01-14 00:00:00 |     1
 2014-01-15 00:00:00 |     1
 2014-01-16 00:00:00 |     1
 2014-02-10 00:00:00 |     1
 2014-02-13 00:00:00 |     1
 2014-02-14 00:00:00 |     2
 2014-02-18 00:00:00 |     1
 2014-02-19 00:00:00 |     2
 2014-02-21 00:00:00 |     1
 2014-02-23 00:00:00 |     2
 2014-02-24 00:00:00 |     1
 2014-03-02 00:00:00 |     1
 2014-03-03 00:00:00 |     1
 2014-03-06 00:00:00 |     3
 2014-03-09 00:00:00 |     2
...

Interval Filling

Let’s say we wanted to reason about the rate of adoption from the above result set, or plot this data in a simple plotting library. We might have a problem. There are numerous gaps in the data where there were no results, e.g. from January 5th and January 13th. The library may not support parsing date strings and managing the time axis properly.

A straightforward technique to solve this problem is to outer join a result set from the generate_series function.

postgres=# with filled_dates as (
  select day, 0 as blank_count from
    generate_series('2014-01-01 00:00'::timestamptz, current_date::timestamptz, '1 day') 
      as day
),
signup_counts as (
  select date_trunc('day', created_at) as day, count(*) as signups
    from users
  group by date_trunc('day', created_at)
)
select filled_dates.day, 
       coalesce(signup_counts.signups, filled_dates.blank_count) as signups
  from filled_dates
    left outer join signup_counts on signup_counts.day = filled_dates.day
  order by filled_dates.day;          day           | signups 
------------------------+---------
 2014-01-01 00:00:00-07 |       0
 2014-01-02 00:00:00-07 |       0
 2014-01-03 00:00:00-07 |       0
 2014-01-04 00:00:00-07 |       0
 2014-01-05 00:00:00-07 |       1
 2014-01-06 00:00:00-07 |       0
 2014-01-07 00:00:00-07 |       0
 2014-01-08 00:00:00-07 |       0
 2014-01-09 00:00:00-07 |       0
 2014-01-10 00:00:00-07 |       0
 2014-01-11 00:00:00-07 |       0
 2014-01-12 00:00:00-07 |       0
 2014-01-13 00:00:00-07 |       1
 2014-01-14 00:00:00-07 |       1
 2014-01-15 00:00:00-07 |       1
 2014-01-16 00:00:00-07 |       1
 2014-01-17 00:00:00-07 |       0

Finite Difference (Discrete Derivative)

Here’s an interesting case for estimating a time series for the transactions/second being processed by a Postgresql cluster. This is a real problem that came up when building Relational Systems.

We start by collecting familiar time series data into a metrics table. In this case we collect a timestamp associated with the result of the txid_current function.

          measured_at          | current_tx_id 
-------------------------------+---------------
 2014-05-03 13:20:46.797304-07 |       1732896
 2014-05-03 13:21:05.012321-07 |       1732923
 2014-05-03 13:21:20.05257-07  |       1732945
 2014-05-03 13:21:35.069332-07 |       1732962
 2014-05-03 13:21:50.102453-07 |       1732991
 2014-05-03 13:22:05.127961-07 |       1733002
 2014-05-03 13:22:20.162577-07 |       1733023
 2014-05-03 13:22:35.189161-07 |       1733034
 2014-05-03 13:22:50.21059-07  |       1733056
 2014-05-03 13:23:20.319999-07 |       1733070
 2014-05-03 13:23:47.909198-07 |       1734933

Ignoring wraparound cases for simplicity, the goal is to query this data for a result set which represents the familiar formula:

$\frac{\text{txid_current}_{n} – \text{txid_current}_{n-1}}{\Delta t_{sec}} \approx \frac{transactions}{second}$

And, fortunately, this is simple with the amazing feature that is window functions.

postgres=# select measured_at,
              (current_tx_id - coalesce(lag(current_tx_id, 1) over w, current_tx_id)) / 
                extract( epoch from  (measured_at - lag(measured_at, 1) over w))::numeric 
                  as tx_sec 
            from heartbeats
              window w as (order by measured_at) 
            order by measured_at desc;          measured_at          |         tx_sec         
-------------------------------+------------------------
 2014-05-04 13:02:56.456229-07 |     1.4642749200123185
 2014-05-04 13:02:41.431728-07 | 0.79921937579501514887
 2014-05-04 13:02:26.417077-07 |     1.4637795079651562
 2014-05-04 13:02:11.387491-07 | 0.79917248353238499975
 2014-05-04 13:01:56.371959-07 |     1.4634836872125585
 2014-05-04 13:01:41.339335-07 | 0.79888332089405429091
 2014-05-04 13:01:26.318368-07 |     1.4639291192137370
 2014-05-04 13:01:11.290318-07 | 0.79880897581705676836
 2014-05-04 13:00:56.267953-07 |    26.7313182385336655
 2014-05-04 13:00:40.743092-07 | 0.49860550013721623364
 2014-05-04 13:00:10.659188-07 |     2.3291032160723028

Window functions allow you to reference records in a window, which is a set of records which have a relationship to the current row. In the example above, the query uses the lag function which returns a value from a row offset before the current row. The window for this query is ordered by the timestamp measured at. Because a lag of 1 references the previous sample of txid_current() the tx_sec field matches the desired formula.

Window functions are remarkably powerful as you can apply aggregates over windows.

Below is a plot based on this query showing what happens when running pgbench -T 100 -c 20 on a commodity desktop.

Final Remarks

The examples illustrate how easy and effective Postgresql can be for querying time series data with date functions, window functions, and series generating functions. There are many more great tools for this type of querying that I hope to explore in future posts.

↧

Shaun M. Thomas: Trumping the PostgreSQL Query Planner

May 8, 2014, 10:24 am

≫ Next: Marko Tiikkaja: PostgreSQL gotcha of the week, week 19

≪ Previous: robert berry: Querying Time Series in Postgresql

With the release of PostgreSQL 8.4, the community gained the ability to use CTE syntax. As such, this is a fairly old feature, yet it’s still misunderstood in a lot of ways. At the same time, the query planner has been advancing incrementally since that time. Most recently, PostgreSQL has gained the ability to perform index-only scans, making it possible to fetch results straight from the index, without confirming rows with the table data.

Unfortunately, this still isn’t enough. There are still quite a few areas where the PostgreSQL query planner is extremely naive, despite the advances we’ve seen recently. For instance, PostgreSQL still can’t do a basic loose index scan natively. It has to be tricked by using CTE syntax.

To demonstrate this further, imagine this relatively common scenario: an order processing system where clients can order products. What happens when we want to find the most recent order for all current customers? Boiled down to its minimum elements, this extremely simplified table will act as our order system.

CREATE TABLE test_order
(
  client_id   INT        NOT NULL,
  order_date  TIMESTAMP  NOT NULL,
  filler      TEXT       NOT NULL
);

Now we need data to test with. We can simulate a relatively old order processing system by taking the current date and subtracting 1,000 days. We can also bootstrap with 10,000 clients, and make the assumption that newer clients will be more active. This allows us to represent clients that have left our services as time goes on. So we start with this test data:

INSERT INTO test_order
SELECT s1.id,
       (CURRENT_DATE - INTERVAL '1000 days')::DATE 
           + generate_series(1, s1.id%1000),
       repeat(' ', 20)
  FROM generate_series(1, 10000) s1 (id);

The generate_series function is very handy for building fake data. We’re still not ready to use that data, however. Since we want to find the most recent order for all customers, we need an index that will combine the client_id and order_date columns in such a way that a single lookup will provide the value we want for any particular client. This index should do nicely:

CREATE INDEX idx_test_order_client_id_order_date
    ON test_order (client_id, order_date DESC);

Finally, we analyze to make sure the PostgreSQL engine has the most recent stats for our table. Just to make everything easily repeatable, we also set the default_statistics_target to a higher value than default as well.

SET default_statistics_target TO 500;
ANALYZE test_order;

Now we’ll start with the most obvious query. Here, we just use the client_id column and look for the max order_date for each:

EXPLAIN ANALYZE
SELECT client_id, max(order_date)
  FROM test_order
 GROUP BY client_id;

The query plan is fairly straight-forward, and will probably include a sequence scan. On the virtual server we’re testing with, the total runtime for us ended up looking like this:

Total runtime: 1117.408 ms

There is some variance, but the end result is just over one second per execution. We ran this query several times to ensure it was properly cached by PostgreSQL. Why didn’t the planner use the index we created? Let’s assume the planner doesn’t know what max does, and treats it like any other function. With that in mind, we can exploit a different type of syntax that should make the index much more usable. So let’s try DISTINCT ON with an explicit ORDER clause that matches the definition of our index:

EXPLAIN ANALYZE SELECT DISTINCT ON (client_id) client_id, order_date FROM test_order ORDER BY client_id, order_date DESC;

Well, this time our test system used an index-only scan, and produced the results somewhat faster. Our new runtime looks like this:

Total runtime: 923.300 ms

That’s almost 20% faster than the sequence scan. Depending on how much bigger the table is than the index, reading the index and producing these results can vary significantly. And while the query time improved, it’s still pretty bad. For systems with tens or hundreds of millions of orders, the performance of this query will continue to degrade along with the row count. We’re also not really using the index effectively.

Reading the index from top to bottom and pulling out the desired results is faster than reading the whole table. But why should we do that? Due to the way we built this index, the root node for each client should always represent the value we’re looking for. So why doesn’t the planner simply perform a shallow index scan along the root nodes? It doesn’t matter what the reason is, because we can force it to do so. This is going to be ugly, but this query will act just as we described:

EXPLAIN ANALYZE
WITH RECURSIVE skip AS
(
  (SELECT client_id, order_date
    FROM test_order
   ORDER BY client_id, order_date DESC
   LIMIT 1)
  UNION ALL
  (SELECT (SELECT min(client_id)
             FROM test_order
            WHERE client_id > skip.client_id
          ) AS client_id,
          (SELECT max(order_date)
             FROM test_order
            WHERE client_id = (
                    SELECT min(client_id)
                      FROM test_order
                     WHERE client_id > skip.client_id
                  )
          ) AS order_date
    FROM skip
   WHERE skip.client_id IS NOT NULL)
)
SELECT *
  FROM skip;

The query plan for this is extremely convoluted, and we’re not even going to try to explain what it’s doing. But the final query execution time is hard to discount:

Total runtime: 181.501 ms

So what happened here? How can the abusive and ugly CTE above outwit the PostgreSQL query planner? We use the same principle as described in the PostgreSQL wiki for loose index scans. We start with the desired maximum order date for a single client_id, then recursively begin adding clients one by one until the index is exhausted. Due to limitations preventing us from using the recursive element in a sub-query, we have to use the SELECT clause to get the next client ID and the associated order date for that client.

This technique works universally for performing sparse index scans, and actually improves as cardinality (the number of unique values) decreases. As unlikely as that sounds, since we are only using the root nodes within the index tree, performance increases when there are less root nodes to check. This is the exact opposite to how indexes are normally used, so we can see why PostgreSQL doesn’t natively integrate this technique. Yet we would like to see it added eventually so query authors can use the first query example we wrote, instead of the excessively unintuitive version that actually produced good performance.

In any case, all PostgreSQL DBAs owe it to themselves and their clusters to learn CTEs. They provide a powerful override for the query planner, and helps solve the edge cases it doesn’t yet handle.

↧

Marko Tiikkaja: PostgreSQL gotcha of the week, week 19

May 8, 2014, 11:19 am

≫ Next: gabrielle roth: PDXPUG: May meeting next week

≪ Previous: Shaun M. Thomas: Trumping the PostgreSQL Query Planner

local:marko=# create table foo(); CREATE TABLE local:marko=#* insert into foo default values; INSERT 0 1 local:marko=#* create function getdata(foo) returns table (a int, b int) as $$ begin if random() < 0.5 then a := 1; b := 1; return next; else a := 2; b := 2; return next; end if ; end $$ language plpgsql ; CREATE FUNCTION local:marko=#* select (getdata(foo)).* from foo; a |

↧

gabrielle roth: PDXPUG: May meeting next week

May 8, 2014, 6:33 pm

≫ Next: Hans-Juergen Schoenig: Casting integer to IP

≪ Previous: Marko Tiikkaja: PostgreSQL gotcha of the week, week 19

When: 7-9pm Thu May 15, 2014
Where: Iovation
Who: Selena Deckelmann
What: The Final Crontab

Crontabber is a new open source utility that makes cron jobs automatically retriable, uses Postgres to store useful information like duration and failure reasons, and integrates easily with Nagios. Come hear about the reasons Mozilla created this tool, and how it’s helped us make our environment more stable and reliable, and less prone to getting calls on the weekend.

Selena Deckelmann is a major contributor to PostgreSQL and a data architect at Mozilla. She’s been involved with free and open source software since 1995 and began running conferences for PostgreSQL in 2007. In 2012, she founded PyLadiesPDX, a portland chapter of PyLadies. She founded Open Source Bridge, Postgres Open and speaks internationally about open source, databases and community. You can find her on twitter (@selenamarie) and on her blog. She also keeps chickens and gives a lot of technical talks.

–

We have a new meeting location while the Iovation offices are being renovated. We’re still in the US Bancorp Tower at 111 SW 5th (5th & Oak), but on the ground floor in the Training Room. As you face the bank of elevators from the main lobby, take a good deep breath of drywall dust and head to your right to a hallway. Follow the hallway as it turns to the left. The Training Room is the 2nd door on your right and is labeled “Training Room”. There is no room number.

The building is on the Green & Yellow Max lines. Underground bike parking is available in the parking garage; outdoors all around the block in the usual spots.

See you there!

↧

Hans-Juergen Schoenig: Casting integer to IP

May 9, 2014, 3:45 am

≫ Next: Josh Berkus: Remastering without restarting

≪ Previous: gabrielle roth: PDXPUG: May meeting next week

Once in a while you have to juggle around with IP addresses and store them / process them in an efficient way. To do so PostgreSQL provides us with two data types: cidr and inet. The beauty here is that those two types make sure that no bad data can be inserted into the database: […]

↧

Josh Berkus: Remastering without restarting

May 9, 2014, 4:58 pm

≫ Next: Michael Paquier: Make Postgres sing with MinGW on Windows

≪ Previous: Hans-Juergen Schoenig: Casting integer to IP

Thanks to Streaming-Only Remastering, PostgreSQL 9.3 has been a boon to high-availability setups and maintenance You can re-arrange your replicas however you like; remastered, in a tree, in a ring, whatever. However, there's been one wart on this free reconfiguration of Postgres replication clusters: if you want to change masters, you have to restart the replica.

This doesn't sound like a big deal, until you think about a cluster with load balancing to 16 read-only replicas. Every one of those you restart breaks a bunch of application connections. Looking at how timeline switch works, it didn't seem like there was even a good reason for this; really, the only thing which seemed to be blocking it was that primary_conninfo comes from recovery.conf, which only gets read on startup. I'd hoped that the merger of recovery.conf and postgresql.conf would solve this, but that patch got punted to 9.5 due to conflicts with SET PERSISTENT.

So, I set out to find a workaround, as well as proving that it was only the deficiencies of recovery.conf preventing us from doing no-restart remastering. And I found one, thanks to the question someone asked me at pgDay NYC.

So, in the diagram above, M1 is the current master. M2 is a replica which is the designated failover target. R1 and R2 are additional replicas. "proxy" is a simple TCP proxy; in fact, I used a python proxy written in 100 lines of code for this test. You can't use a Postgres proxy like pgBouncer because it won't accept a replication connection.

Remastering time!

Shut down M1
Promote M2
Restart the proxy, now pointing to M2

And the new configuration:

But: what happened to R1 and R2? Did they remaster without restarting? Yes, indeedy, they did!

LOG: entering standby mode
LOG: redo starts at 0/21000028
LOG: consistent recovery state reached at 0/210000F0
LOG: database system is ready to accept read only connections
LOG: started streaming WAL from primary at 0/22000000 on timeline 6
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 6 at 0/2219C1F0.
FATAL: could not send end-of-streaming message to primary: no COPY in progress
FATAL: could not connect to the primary server: could not connect to server: Connection refused
Is the server running on host "172.31.11.254" and accepting
TCP/IP connections on port 9999?

LOG: fetching timeline history file for timeline 7 from primary server
LOG: started streaming WAL from primary at 0/22000000 on timeline 6
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 6 at 0/2219C1F0.
LOG: new target timeline is 7
LOG: restarted WAL streaming at 0/22000000 on timeline 7

Not only does this provide us a new remastering workaround for high-availability configurations on 9.3, it also shows us that as soon as we get around to merging recovery.conf with postgresql.conf, restarting to remaster can be eliminated.

↧

Michael Paquier: Make Postgres sing with MinGW on Windows

May 9, 2014, 9:58 pm

≫ Next: Pavel Stehule: A speed of PL languages for atypical usage

≪ Previous: Josh Berkus: Remastering without restarting

Community usually lacks developers on Windows able to test and provide feedback on patches that are being implemented. Actually, by seeing bug reports from users on Windows on a daily basis (not meaning that each report is really a bug occurring only on this platform but that there are many users of it), having more people in the field would be great.

Doing development on a different platform does not usually mean a lot of things:

Compiling manually code with a patch and check if a feature is really working as expected on a platform. This would be a more natural process than waiting for the buildfarm to become red with a Windows-only failure.
Helping people with their new features. A patch could be rejected because it proposes something that is not cross-platform.
Getting new buildfarm machines publishing results that developers could use to stabilize code properly.

Either way, jumping from a development platform to another (especially with already years of experience doing Linux/Unix things) may not be that straight-forward, and even quite painful if this platform has its code closed as you may not be able to do everything you want with it to satisfy your needs.

PostgreSQL provides many ways to be able to compile its code on Windows for quite a long time now, involving for example the Windows SDK and even Visual C++ (MSVC), but as well another, more hacky way using MinGW. This may be even an easier way to enter in the world of Windows development if you are used to Unix-like OSes as you do not need to rely on some externally-developped SDK and still need to type by yourself commands like ./configure and make.

First of all, once you have your hands on a Windows box, perhaps the best thing to do is to install msysGit that really helps to provide an experience of Git on Windows close to what you can live on Unix/Linux platforms. The console provided is not perfect (need to use the Edit->[Mark|Paste] instead of a plain Ctrl-[C|V] for any copy paste operation), but this is better than the native Command Prompt of Windows if you are not used to it. Also, one thing to not forget is that the paths to each disk are not prefixed with "C:\PATH" but with /c/$PATH.

Then, continue with the installation of MinGW. Simply download it and then install it in a custom folder like. Something like 7-zip is helpful to extract the content from tarballs. You may as well consider another option to get 64-bit binaries with for example MinGW-w64. Some extra instructions on how to use it are available here.

Even after deploying a MinGW build, you may need a proper make command as it may not be available in what you downloaded (that's actually what I noticed with a MinGW-w64 build). This can for example be taken from one of the stable snapshots of MinGW after renaming what is present there properly (make commands are renamed to not conflict with MSYS).

Note as well that the Postgres wiki has some additional notes you may find helpful.

It is usually adviced to deploy MinGW in a path like "C:\mingw" but this is up to you as long as its binary folder is included in PATH, resulting in that with msysgit for example:

export PATH=$PATH:/c/mingw/bin

Once you got that done, fetch the code from Postgres git repository and begin the real work. Here are a couple of things to know though when beginning that.

First, the configure command should enforce a couple of environment to make the build work smoothly with msysGit. Also a value should be provided to "--host" to be able to detect the compiler shipped with MinGW. This results in a configure command similar to that:

PERL=perl \
    BISON=bison \
    FLEX=flex \
    MKDIR_P="mkdir -p" \
    configure --host=x86_64-w64-mingw32 --without-zlib

Note as well that zlib is disabled for simplicity.

Once compilation has been done, be sure to change as well the calls of "$(SHELL)" to "bash" like that in src/Makefile.global:

sed -i "s#\$(SHELL)#bash#g" src/Makefile.global

All those things do not actually require to modify Postgres core code, so it is up to you to modify your build scripts depending on your needs.

There are as well a couple of things to be aware if you try to backport scripts that you have been using in other environments (some home-made scripts have needed patches in my case):

Be sure to update any newline of the type "\n" with "\r\n", this will avoid parsing failures when inserting multiple lines at the same time inside a single file like pg_hba.conf.
USER is not a valid environment variable, USERNAME is.
Servers cannot start if kicked by users having Administrator privileges
Compilation is slow...

Once compilation works correctly, you will be able to get something like that:

=# SELECT substring(version(), 1, 73);
                                 substring
---------------------------------------------------------------------------
 PostgreSQL 9.4devel on x86_64-w64-mingw32, compiled by x86_64-w64-mingw32
(1 row)

Then you can congratulate yourself and enjoy a glass of wine.

↧

Pavel Stehule: A speed of PL languages for atypical usage

May 11, 2014, 12:14 am

≫ Next: Hubert 'depesz' Lubaczewski: Joining BTree and GIN/GiST indexes

≪ Previous: Michael Paquier: Make Postgres sing with MinGW on Windows

A speed of PL languages for atypical usage

A typical usage of PL languages should be a glue of SQL statements. But sometimes can be useful use these languages for PostgreSQL library enhancing.
I test a simple variadic function - function "least" that I can to compare with native C implementation (buildin). I was little bit surprised by speed of Lua - it is really fast and only one order slower than C implementation - PL/pgSQL is not bad - it is slower than PL/Lua - but only two times (it is relative very fast SQL glue).

-- native implementation
postgres=# select count(*) filter (where a = least(a,b,c,d,e)) from foo;
 count 
───────
 20634
(1 row)

Time: 55.776 ms

Table foo has about 100K rows.

create table foo(a int, b int, c int, d int, e int);
insert into foo select random()*100, random()*100, random()*100, random()*100, random()*100 from generate_series(1,100000);

postgres=# select count(*) from foo;
 count  
────────
 100000
(1 row)

Time: 21.305 ms

I started with PL/pgSQL

CREATE OR REPLACE FUNCTION public.myleast1(VARIADIC integer[])
 RETURNS integer LANGUAGE plpgsql IMMUTABLE STRICT
AS $function$
declare result int;
a int;
begin
  foreach a in array $1
  loop
    if result is null then 
      result := a; 
    elseif a < result then
      result := a;
    end if;
  end loop;
  return result;
end;
$function$

postgres=# select count(*) filter (where a = myleast1(a,b,c,d,e)) from foo;
 count 
───────
 20634
(1 row)

Time: 996.684 ms

with small optimization (possible due result is not varlena type) it is about 3% faster

CREATE OR REPLACE FUNCTION public.myleast1a(VARIADIC integer[])
 RETURNS integer LANGUAGE plpgsql IMMUTABLE STRICT
AS $function$
declare result int;
a int;
begin
  foreach a in array $1
  loop
    if a < result then 
      result := a; 
    else
      result := coalesce(result, a);
    end if;
  end loop;
  return result;
end;
$function$

postgres=# select count(*) filter (where a = myleast1a(a,b,c,d,e)) from foo;
 count 
───────
 20634
(1 row)

Time: 968.769 ms

Wrapping SQL in PL/pgSQL doesn't help

CREATE OR REPLACE FUNCTION public.myleast2(VARIADIC integer[])
 RETURNS integer LANGUAGE plpgsql IMMUTABLE STRICT
AS $function$
declare result int;
a int;
begin
  return (select min(v) from unnest($1) g(v));
end;
$function$

postgres=# select count(*) filter (where a = myleast2(a,b,c,d,e)) from foo;
 count 
───────
 20634
(1 row)

Time: 1886.462 ms

Single line SQL functions is not faster than PL/pgSQL - the body of SQL function is not trivial, and Postgres cannot to inline function body effectively

CREATE OR REPLACE FUNCTION public.myleast3(VARIADIC integer[])
 RETURNS integer LANGUAGE sql IMMUTABLE STRICT
AS $function$select min(v) from unnest($1) g(v)$function$

postgres=# select count(*) filter (where a = myleast3(a,b,c,d,e)) from foo;
 count 
───────
 20634
(1 row)

Time: 1238.185 ms

A winner of this test is implementation in PL/Lua - the code is readable and pretty fast.

CREATE OR REPLACE FUNCTION public.myleast4(VARIADIC a integer[])
 RETURNS integer LANGUAGE pllua IMMUTABLE STRICT
AS $function$
local result;
for k,v in pairs(a) do 
  if result == nil then 
    result = v
  elseif v < result then 
    result = v
  end; 
end
return result;
$function$

postgres=# select count(*) filter (where a = myleast4(a,b,c,d,e)) from foo;
 count 
───────
 20634
(1 row)

Time: 469.174 ms

By contrast I was surprised a slower speed of PL/Perl (and write code was little bit more difficult). Sometimes I used a perl for similar small functions and looks so Lua is better than Perl for these purposes.

CREATE OR REPLACE FUNCTION public.myleast5(VARIADIC integer[])
 RETURNS integer LANGUAGE plperl IMMUTABLE STRICT
AS $function$
for my $value (@{$_[0]} ) {
  if (! defined $result) {
    $result = $value;
  } elsif ( $value < $result ) {
    $result = $value; 
  }
}
return $result;
$function$

postgres=# select count(*) filter (where a = myleast5(a,b,c,d,e)) from foo;
 count 
───────
   535
(1 row)

Time: 1591.802 ms

PL/Lua is not well known development environment - although it looks so for similar use cases is second candidate after C language. Still Perl is best for other special use cases due unlimited support of available extensions on CPAN.

Second note: this synthetic benchmark is not pretty fair - in really typical use case a real bottleneck is IO operations and the speed of similar functions should not be significant - this class of databases like Postgres, MSSQL, Oracle is hardly optimized on minimize IO operations. Numeric calculations are secondary target. Probably any IO waits clean differences between these implementations.

↧

Hubert 'depesz' Lubaczewski: Joining BTree and GIN/GiST indexes

May 12, 2014, 2:52 pm

≫ Next: Josh Berkus: cstore_fdw and big data

≪ Previous: Pavel Stehule: A speed of PL languages for atypical usage

Today, I'd like to show you how you can use the same index for two different types of conditions. One that is using normal BTree indexing ( equal, less than, greater than ), and one that is using GIN/GiST index, for full text searching. Before I will go any further – I will be showing […]

↧

Josh Berkus: cstore_fdw and big data

May 12, 2014, 3:26 pm

≫ Next: Joel Jacobson: psql \watch 1400000000 epoch time countdown counter

≪ Previous: Hubert 'depesz' Lubaczewski: Joining BTree and GIN/GiST indexes

About a month ago, PostgreSQL fork vendor and Data Warehousing company CitusDB announced the availability of the open-source cstore_fdw. This foreign data wrapper creates an external table with highly compressed data, which allows you to keep large amounts of archival data on your PostgreSQL server.

You can find out more if you want to tune in tommorrow night, May 13th, around 7:15PM PDT. Tomorrow night's even will be sponsored by CitusDB and hosted by Rackspace.

First, the good stuff: compression:

phc=# select pg_size_pretty(pg_total_relation_size('postgres_log'));
pg_size_pretty
----------------
28 GB

ls -lh $PGDATA/base

-rw------- 1 postgres postgres 3.2G May 12 13:37 pglog.cstore
-rw------- 1 postgres postgres 12K May 12 13:37 pglog.cstore.footer

So, the "postgres_log" table from this Performance Health Check database, which has 15 million records, takes up 28GB in postgres, and 3.2GB as a cstore table ... a space savings of about 89%. Not bad. Especially if you consider that the cstore table already has skip indexes on all indexable columns.

Now, where this space savings becomes a real benefit is if the cstore table fits in memory and the Postgres table doesn't. I don't have a case like that, although the cstore still does show performance benefits if you have a wide table and you don't need all columns:

phc=# select count(1) from postgres_log where command_tag = 'UPDATE';
count
--------
986390
(1 row)

Time: 23746.476 ms
phc=# select count(1) from c_pglog where command_tag = 'UPDATE';
count
--------
986390
(1 row)

Time: 14059.405 ms

And even better if you can apply a relatively restrictive filter:

phc=# select count(1) from postgres_log where command_tag = 'UPDATE' and log_time BETWEEN '2014-04-16 07:15:00' and '2014-04-16 07:20:00';
count
-------
84982
(1 row)

Time: 19653.746 ms

phc=# select count(1) from c_pglog where command_tag = 'UPDATE' and log_time BETWEEN '2014-04-16 07:15:00' and '2014-04-16 07:20:00';
count
-------
84982
(1 row)

Time: 2260.891 ms

One limitation is that currently, with FDWs not able to cleanly push down aggregation to the foreign data wrapper, the actual aggregation is still done on the postgres side. This means that large aggregates are about the same speed on cstore_fdw as they are for PostgreSQL tables:

phc=# select round((sum(duration)/1000)::numeric,2) from statements where command_tag = 'UPDATE'; round
--------
444.94
(1 row)

Time: 2920.640 ms
phc=# select round((sum(duration)/1000)::numeric,2) from c_statements where command_tag = 'UPDATE';
round
--------
444.94
(1 row)

Time: 3232.986 ms

The project plans to fix this, but until then, cstore_fdw is useful mainly for searches across really large/wide tables. Or for seldom-touched archive tables where you want to save yourself GB or TB of disk space.

There are a bunch of other features, and a bunch of other limitations; tune in to the SFPUG event to learn more.

↧

Joel Jacobson: psql \watch 1400000000 epoch time countdown counter

May 13, 2014, 12:35 am

≫ Next: Robert Haas: Troubleshooting Database Corruption

≪ Previous: Josh Berkus: cstore_fdw and big data

SET TIMEZONE TO 'UTC';
\t
\a
\pset fieldsep ' '
SELECT
    (('epoch'::timestamptz + 14*10^8 * '1 s'::interval)-now())::interval(0),
    (14*10^8-extract(epoch from now()))::int,
    extract(epoch from now())::int
;
\watch 1

09:18:28 33508 1399966492
09:18:27 33507 1399966493
09:18:26 33506 1399966494
09:18:25 33505 1399966495
09:18:24 33504 1399966496
09:18:23 33503 1399966497

↧

Robert Haas: Troubleshooting Database Corruption

May 13, 2014, 10:34 am

≫ Next: Tim van der Linden: PostgreSQL: A full text search engine - Part 3

≪ Previous: Joel Jacobson: psql \watch 1400000000 epoch time countdown counter

When your database gets corrupted, one of the most important things to do is figure out why that happened, so that you can try to ensure that it doesn't happen again. After all, there's little point in going to a lot of trouble to restore a corrupt database from backup, or in attempting to repair the damage, if it's just going to get corrupted again. However, there are times when root cause analysis must take a back seat to getting your database back on line.

Read more »

↧

Tim van der Linden: PostgreSQL: A full text search engine - Part 3

May 14, 2014, 2:00 am

≫ Next: Michael Paquier: Postgres 9.4 feature highlight: MSVC installer for client binaries and libraries

≪ Previous: Robert Haas: Troubleshooting Database Corruption

And so we arrive at the last part of the series.

If you have not done so, please read part one and part two before embarking.

Today we will close up the introduction into PostgreSQL's full text capabilities by showing you a few aspects I have intentionally neglected in the previous parts. The most important ones being ranking and indexing.

So let us take off right away!

Ranking

Up until now you have seen what full text is, how to use it and how to do a full custom setup. What you have not yet seen is how to rank search results based on their relevance to the search query - a feature that most search engines offer and one that most users expect.

However, there is a problem when it comes to ranking, it something that is somewhat undefined. It is a gray area left wide open to interpretation. It is almost...personal.

In its core, ranking within full text means giving a document a place based on how many times certain words occur in a document, or how close these words are relevant to each other. So let us start there.

Normal ranking

The first case, ranking based on how many times certain words occur, has a accompanying function ready to be used: ts_rank(). It accepts a mandatory tsvector and a tsquery as its arguments and returns a float which represents how high the given document ranks. The function also accepts a weights array and normalization integer, but that is for later down the road.

Let us test out the basic functionality:

SELECTts_rank(to_tsvector('Elephants and dolphins do not live in the same habitat.'),to_tsquery('elephants'));

This is an regular old 'on the fly' query where we feed a string which we convert to a tsvector and a token which is converted to a tsquery. The ranking result of this is:

0.0607927

This does not say much, does it? Okay, let us throw a few more tokens in the mix:

SELECTts_rank(to_tsvector('Elephants and dolphins do not live in the same habitat.'),to_tsquery('elephants & dolphins'));

Now we want to query the two tokens elephants and dolphins. We chain them together in an AND (&) formation. The ranking:

0.0985009

Hmm, getting higher, good. More tokens please:

SELECTts_rank(to_tsvector('Elephants and dolphins do not live in the same habitat.'),to_tsquery('elephants & dolphins & habitat & living'));

Results in:

0.414037

Oooh, that is quite nice. Notice the word living, the tsquery automatically stems it to match live, but that is, of course, all basic knowledge by now.

The idea here is simple, the more tokens match the string, the higher the ranking will be. You can use this float to later on sort your results.

Normal ranking with weights

Okay, let us spice things up a little bit, let us look at the weights array that could be set as an optional parameter.

Do you remember the weights we saw in chapter one? A quick rundown: You can optionally give weights to lexemes in a tsvector to group them together. This is, most of the time, used to reflect the original document structure within a tsvector. We also saw that, actually, all lexemes contain a standard weight of 'D' unless specified otherwise.

Weights, when ranking, define importance of words. The ts_rank() function will automatically take these weights into account and use a weights array to influence the ranking float. Remember that there are only four possible weights: A, B, C and D.

The weights array has a default value of:

{0.1, 0.2, 0.4, 1.0}

These values correspond to the weight letters you can assign. Note that these are in reverse order, the array represents: {D,C,B,A}.

Let us test that out. We take the same query as before, but now using the setweight() function, we will apply a weight of C to all lexemes:

SELECTts_rank(setweight(to_tsvector('Elephants and dolphins do not live in the same habitat.'),'C'),to_tsquery('elephants & dolphins & habitat & live'));

The result:

0.674972

Wow, that is a lot higher then our last ranking (which had an implicit, default weight of D). The reason for this is that the floats in the weights array influence the ranking calculation. Just for fun, you can override the default weights array, simply by passing it in as a first argument. Let us put the weights all equal to the default of D being 0.1:

SELECTts_rank(array[0.1,0.1,0.1,0.1],setweight(to_tsvector('Elephants and dolphins do not live in the same habitat.'),'C'),to_tsquery('elephants & dolphins & habitat & live'));

And we get back:

0.414037

You can see that this is now back to the value we had before we assigned weights, or in other words, when the implicit weight was D. You can thus influence what kind of an effect a certain weight has in you ranking. You can even reverse the lot and make a D have a more positive influence then an A, just to mess with peoples heads.

Normal ranking, the fair way

Not that what we have seen up until now was unfair, but is does not take into account the length of the documents searched through

Document length is also an important factor when judging the relevance. A short document which matches on four or five tokens has a different relevance than a three times as long document which matches on the same amount of tokens. The shorter one is probably more relevant then the longer one.

The same ranking function ts_rank() has an extra, final optional parameter that you can pass in called the normalization integer. This integer can have a combination of seven different values, they can be a single integer or mixed with a pipe (|) to pass in multiple values.

The default value is 0 - meaning that it will ignore document length all together, giving us the more "unfair" behavior. The next values you can give are 1, 2, 4, 8, 16 and 32 which stand for the following manipulations of the ranking float:

1: It will divide the ranking float by the sum of 1 and the logarithmic number of the document length. The latter number is the ratio this document has compared to the other documents you wish to compare.
2: Simply divides the ranking float by the length of the document.
4: Divides the ranking float by the harmonic mean (the fair average) between matched tokens. This one is only uses by the other ranking function ts_rank_cd.
8: Divides the ranking float by the number of unique words that are found in the document.
16: Divides the ranking float by the sum of 1 and the logarithmic number of the number of unique words found in the document.
32: Simply divides the ranking float by itself and adds one to that.

These are a lot of values and some of them are quite confusing. But all of these have only one purpose: to make ranking more "fair", based on various use cases.

Take, for example, 1 and 2. These calculate document length by taking into account the amount of words present in the document. The words here reference the amount of pointers that are present in the tsvector.

To illustrate, we will convert the sentence "These token are repeating on purpose. Bad tokens!" into a tsvector, resulting in:

'bad':7 'purpos':6 'repeat':4 'token':2,8

The length of this document is 5, becuase we have five pointers in total.

If you now look at the integers 8 and 16, they take the uniqueness to calculate document length. What that means is they do not count the pointers, but the actual lexemes. In the above tsvector and thus would result in a length of 4.

All of these manipulations are just different ways of counting document length. The ones summed up in the above integer list are mere educated guesses at what most people desire when ranking with a full text engine. As I said in the beginning, it is a gray area, left open for interpretation.

Let us try to see the different effects that such an integer can have.

First we need to create a few documents (tsvectors) inside our famous phraseTable (from the previous chapters) that we will use throughout this chapter. Connect to your phrase database, add a "title" column, truncate whatever we have stored there and insert a few variable length documents based on Edgar Allan Poe's "The Raven". I have prepared the whole syntax below, this time you may copy-and-paste:

ALTERTABLEphraseTableADDCOLUMNtitleVARCHAR;TRUNCATEphraseTable;INSERTintophraseTableVALUES(to_tsvector('Once upon a midnight dreary, while I pondered, weak and weary, Over many a quaint and curious volume of forgotten lore, While I nodded, nearly napping, suddenly there came a tapping, As of some one gently rapping, rapping at my chamber door. "Tis some visitor," I muttered, "tapping at my chamber door - Only this, and nothing more."'),'Tiny Allan');INSERTintophraseTableVALUES(to_tsvector('Once upon a midnight dreary, while I pondered, weak and weary, Over many a quaint and curious volume of forgotten lore, While I nodded, nearly napping, suddenly there came a tapping, As of some one gently rapping, rapping at my chamber door. "Tis some visitor," I muttered, "tapping at my chamber door - Only this, and nothing more." Ah, distinctly I remember it was in the bleak December, And each separate dying ember wrought its ghost upon the floor. Eagerly I wished the morrow - vainly I had sought to borrow From my books surcease of sorrow - sorrow for the lost Lenore - For the rare and radiant maiden whom the angels name Lenore - Nameless here for evermore.'),'Small Allan');INSERTintophraseTableVALUES(to_tsvector('Once upon a midnight dreary, while I pondered, weak and weary, Over many a quaint and curious volume of forgotten lore, While I nodded, nearly napping, suddenly there came a tapping, As of some one gently rapping, rapping at my chamber door. "Tis some visitor," I muttered, "tapping at my chamber door - Only this, and nothing more." Ah, distinctly I remember it was in the bleak December, And each separate dying ember wrought its ghost upon the floor. Eagerly I wished the morrow - vainly I had sought to borrow From my books surcease of sorrow - sorrow for the lost Lenore - For the rare and radiant maiden whom the angels name Lenore - Nameless here for evermore. And the silken sad uncertain rustling of each purple curtain Thrilled me - filled me with fantastic terrors never felt before; So that now, to still the beating of my heart, I stood repeating, "Tis some visitor entreating entrance at my chamber door - Some late visitor entreating entrance at my chamber door - This it is, and nothing more."'),'Medium Allan');INSERTintophraseTableVALUES(to_tsvector('Once upon a midnight dreary, while I pondered, weak and weary, Over many a quaint and curious volume of forgotten lore, While I nodded, nearly napping, suddenly there came a tapping, As of some one gently rapping, rapping at my chamber door. "Tis some visitor," I muttered, "tapping at my chamber door - Only this, and nothing more." Ah, distinctly I remember it was in the bleak December, And each separate dying ember wrought its ghost upon the floor. Eagerly I wished the morrow - vainly I had sought to borrow From my books surcease of sorrow - sorrow for the lost Lenore - For the rare and radiant maiden whom the angels name Lenore - Nameless here for evermore. And the silken sad uncertain rustling of each purple curtain Thrilled me - filled me with fantastic terrors never felt before; So that now, to still the beating of my heart, I stood repeating, "Tis some visitor entreating entrance at my chamber door - Some late visitor entreating entrance at my chamber door - This it is, and nothing more."  Presently my soul grew stronger; hesitating then no longer, "Sir," said I, "or Madam, truly your forgiveness I implore; But the fact is I was napping, and so gently you came rapping, And so faintly you came tapping, tapping at my chamber door, That I scarce was sure I heard you"- here I opened wide the door; - Darkness there, and nothing more.'),'Big Allan');

Nothing better then some good old Edgar to demonstrate a full text search ranking. Here we have four different lengths of the same verse making for four documents of different lengths stored in our tsvector column. Now we would like to search through these documents and find the keywords 'door' and 'gently', ranking them as we go.

For later reference, let us first count how many times our keywords occur in the sentence:

Tiny Allan: "door" 2, "gently" 1
Small Allan: "door" 2, "gently" 1
Medium Allan: "door" 4, "gently" 1
Big Allan: "door" 6, "gently" 2

First, let us simply rank the result with the default normalization of 0:

SELECTtitle,ts_rank(phrase,keywords)ASrankFROMphraseTable,to_tsquery('door & gently')keywordsWHEREkeywords@@phrase;

Before we go over the results, a little bit about this query for people who are not so familiar with this SQL syntax. We do a simple SELECT from a data set using FROM filtering it with a WHERE clause. Going over it line by line:

SELECTtitle,ts_rank(phrase,keywords)ASrank

We SELECT on the title column we just made and on a "on-the-fly" column we create for the result set named rank which contains the result of the ts_rank() function.

FROMphraseTable,to_tsquery('door & gently')keywords

In the FROM clause you can put a series of statements that will deliver the data for the query. In this case we take our normal database table and the result of the to_tsquery() function which we name keywords so we can use it throughout the query itself.

WHEREkeywords@@phrase;

Here we filter the result set using the WHERE clause and the matching operator (@@). The @@ is a Boolean operator, meaning it will simply return true or false. So in this case, we check if the result of the to_tsquery() function (named keywords and which will return lexemes) match the results of the phrase column from our table (which contains tsvectors and thus lexemes). We want to rank only those phrases that actually contain our keywords.

Now, back to our ranking. The result of this query will be:

   title     |   rank    
--------------+-----------
Tiny Allan   | 0.0906565
Small Allan  | 0.0906565
Medium Allan | 0.0906565
Big Allan    |   0.10109

Let us order the results first, so the most relevant document is always on top:

SELECTtitle,ts_rank(phrase,keywords)ASrankFROMphraseTable,to_tsquery('door & gently')keywordsWHEREkeywords@@phraseORDERBYrankDESC;

Result:

   title     |   rank    
--------------+-----------
Big Allan    |   0.10109
Tiny Allan   | 0.0906565
Small Allan  | 0.0906565
Medium Allan | 0.0906565

"Big Allen" is on top, for it has more occurrences of the keywords "door" and "gently". But to be fair, in ratio "Tiny Allan" has almost the same amount of occurrences of both keywords. Three times less, but it also is three times as small.

So let us take document length (based on word count) into account, setting our normalization to 1:

SELECTtitle,ts_rank(phrase,keywords,1)ASrankFROMphraseTable,to_tsquery('door & gently')keywordsWHEREkeywords@@phraseORDERBYrankDESC;

You will get:

   title     |   rank    
--------------+-----------
Tiny Allan   | 0.0181313
Small Allan  | 0.0151094
Big Allan    | 0.0145124
Medium Allan |  0.013831

This could be seen as a more fair ranking, "Tiny Allan" is now on top because, considering its ratio, it is the most relevant. "Medium Allan" falls all the way down because it is almost as big as "Big Allan", but contains lesser occurrences of the keywords. In total five keywords in contrast to "Big Allan" who has eight but is only slightly bigger.

Let us do the same, but count the document length based on the unique occurences using integer 8:

SELECTtitle,ts_rank(phrase,keywords,8)ASrankFROMphraseTable,to_tsquery('door & gently')keywordsWHEREkeywords@@phraseORDERBYrankDESC;

The result:

Tiny Allan   | 0.00335765
Small Allan  | 0.00161887
Medium Allan | 0.00119285
Big Allan    | 0.00105303

That is a very different result, but quite what you should expect.

We are searching for only two tokens here, and considering the fact that uniqueness is now adhered, all the extra occurrences of these words are ignored. This means that for the ranking algorithm, all the documents we searched through (which all have at least one occurrence of each token) get normalized to only 2 matching tokens. And in that case, the shortest document wins hands down, for it is seen as most relevant. As you can see in the result set, the documents are neatly ordered from tiny to big.

Ranking with density

Up until now we have seen the "normal" ranking function ts_rank(), which is the one you will probably use the most.

There is, however, one more function at our direct disposal called ts_rank_cd(). The cd stands for Cover Density and is simply yet another way of considering relevance. This function has exactly the same required and optional arguments, it simply counts relevancy differently. Very important for this function to work properly is that you do not let it operate on a stripped tsvector.

A stripped tsvector is one that has been undone of its pointer information. If you know that you do not need this pointer information - you just need to match tsqueries against the lexemes in you tsvector - you can strip these pointers and thus make for smaller footprints in your database.

In case of our cover density ranker, it needs this positional pointer information to see how close the search tokens are to each other. It makes sense that this ranking function only works on multiple tokens, on single tokens it is kind of pointless.

In a way, this ranking function looks for phrases rather then single tokens; the closer lexemes are together, the more positive influence they will have on the resulting ranking float.

In our "Raven" examples this might be a little bit hard to see, so let me demonstrate this with a couple of new, on-the-fly queries.

We wish to search for the tokens 'token' and 'count'.

First, a sentence in which the searched for tokens are wide apart: "These tokens are very wide apart and therefor do not count as much.":

SELECTts_rank(phrase,keywords,8)ASrankFROMto_tsvector('These tokens are very wide apart and do not count as much.')phrase,to_tsquery('token & count')keywordsWHEREkeywords@@phraseORDERBYrankDESC;

Will have this tsvector:

'apart':6 'count':10 'much':12 'token':2 'wide':5

And this result:

 0.008624

Let us put these tokens closer together now: "These tokens count for much now that they are not so wide apart!":

SELECTts_rank(phrase,keywords,8)ASrankFROMto_tsvector('These tokens count for much now that they are not so wide apart!')phrase,to_tsquery('token & count')keywordsWHEREkeywords@@phraseORDERBYrankDESC;

The vector:

'apart':13 'count':3 'much':5 'token':2 'wide':12

The result:

0.0198206

You can see that both the vectors have exactly the same lexemes, but different pointer information. In the second vector, the tokens we searched for are next to each other, which results in a ranking float that is more then double of the first result.

This demonstrates the working of this function. The same optional manipulations can be passed in (weights and normalization) and they will have roughly the same effect.

Pick the ranking function that is best fit for your use case.

It needs to be said that the two ranking functions we have seen so far are officially called example functions by the PostgreSQL community. They are functions devised to be fitting for most purposes, but also to demonstrate how you could write your own.

If you have very specific use cases it is advised to write you own ranking functions to fit your exact needs. But this is considered beyond the scope of this series (and maybe also beyond the scope of your needs).

Highlight your results!

The next interesting thing we can do with the results of our full text is to highlight the relevant words.

As is the case with many search engines, users want to skim over an excerpt of each result to see if it is what they are searching for. For this PostgreSQL delivers us yet another function: ts_headline().

To demonstrate its use, we first have to make our small database a little bit bigger by inserting the original text of the "Raven" next to our tsvectors. So, again, copy and past this new set of queries (yes you may...):

TRUNCATEphraseTable;ALTERTABLEphraseTableADDCOLUMNarticleTEXT;INSERTintophraseTableVALUES(to_tsvector('Once upon a midnight dreary, while I pondered, weak and weary, Over many a quaint and curious volume of forgotten lore, While I nodded, nearly napping, suddenly there came a tapping, As of some one gently rapping, rapping at my chamber door. "Tis some visitor," I muttered, "tapping at my chamber door - Only this, and nothing more."'),'Tiny Allan','Once upon a midnight dreary, while I pondered, weak and weary, Over many a quaint and curious volume of forgotten lore, While I nodded, nearly napping, suddenly there came a tapping, As of some one gently rapping, rapping at my chamber door. "Tis some visitor," I muttered, "tapping at my chamber door - Only this, and nothing more."');INSERTintophraseTableVALUES(to_tsvector('Once upon a midnight dreary, while I pondered, weak and weary, Over many a quaint and curious volume of forgotten lore, While I nodded, nearly napping, suddenly there came a tapping, As of some one gently rapping, rapping at my chamber door. "Tis some visitor," I muttered, "tapping at my chamber door - Only this, and nothing more." Ah, distinctly I remember it was in the bleak December, And each separate dying ember wrought its ghost upon the floor. Eagerly I wished the morrow - vainly I had sought to borrow From my books surcease of sorrow - sorrow for the lost Lenore - For the rare and radiant maiden whom the angels name Lenore - Nameless here for evermore.'),'Small Allan','Once upon a midnight dreary, while I pondered, weak and weary, Over many a quaint and curious volume of forgotten lore, While I nodded, nearly napping, suddenly there came a tapping, As of some one gently rapping, rapping at my chamber door. "Tis some visitor," I muttered, "tapping at my chamber door - Only this, and nothing more." Ah, distinctly I remember it was in the bleak December, And each separate dying ember wrought its ghost upon the floor. Eagerly I wished the morrow - vainly I had sought to borrow From my books surcease of sorrow - sorrow for the lost Lenore - For the rare and radiant maiden whom the angels name Lenore - Nameless here for evermore.');INSERTintophraseTableVALUES(to_tsvector('Once upon a midnight dreary, while I pondered, weak and weary, Over many a quaint and curious volume of forgotten lore, While I nodded, nearly napping, suddenly there came a tapping, As of some one gently rapping, rapping at my chamber door. "Tis some visitor," I muttered, "tapping at my chamber door - Only this, and nothing more." Ah, distinctly I remember it was in the bleak December, And each separate dying ember wrought its ghost upon the floor. Eagerly I wished the morrow - vainly I had sought to borrow From my books surcease of sorrow - sorrow for the lost Lenore - For the rare and radiant maiden whom the angels name Lenore - Nameless here for evermore. And the silken sad uncertain rustling of each purple curtain Thrilled me - filled me with fantastic terrors never felt before; So that now, to still the beating of my heart, I stood repeating, "Tis some visitor entreating entrance at my chamber door - Some late visitor entreating entrance at my chamber door - This it is, and nothing more."'),'Medium Allan','Once upon a midnight dreary, while I pondered, weak and weary, Over many a quaint and curious volume of forgotten lore, While I nodded, nearly napping, suddenly there came a tapping, As of some one gently rapping, rapping at my chamber door. "Tis some visitor," I muttered, "tapping at my chamber door - Only this, and nothing more." Ah, distinctly I remember it was in the bleak December, And each separate dying ember wrought its ghost upon the floor. Eagerly I wished the morrow - vainly I had sought to borrow From my books surcease of sorrow - sorrow for the lost Lenore - For the rare and radiant maiden whom the angels name Lenore - Nameless here for evermore. And the silken sad uncertain rustling of each purple curtain Thrilled me - filled me with fantastic terrors never felt before; So that now, to still the beating of my heart, I stood repeating, "Tis some visitor entreating entrance at my chamber door - Some late visitor entreating entrance at my chamber door - This it is, and nothing more');INSERTintophraseTableVALUES(to_tsvector('Once upon a midnight dreary, while I pondered, weak and weary, Over many a quaint and curious volume of forgotten lore, While I nodded, nearly napping, suddenly there came a tapping, As of some one gently rapping, rapping at my chamber door. "Tis some visitor," I muttered, "tapping at my chamber door - Only this, and nothing more." Ah, distinctly I remember it was in the bleak December, And each separate dying ember wrought its ghost upon the floor. Eagerly I wished the morrow - vainly I had sought to borrow From my books surcease of sorrow - sorrow for the lost Lenore - For the rare and radiant maiden whom the angels name Lenore - Nameless here for evermore. And the silken sad uncertain rustling of each purple curtain Thrilled me - filled me with fantastic terrors never felt before; So that now, to still the beating of my heart, I stood repeating, "Tis some visitor entreating entrance at my chamber door - Some late visitor entreating entrance at my chamber door - This it is, and nothing more."  Presently my soul grew stronger; hesitating then no longer, "Sir," said I, "or Madam, truly your forgiveness I implore; But the fact is I was napping, and so gently you came rapping, And so faintly you came tapping, tapping at my chamber door, That I scarce was sure I heard you"- here I opened wide the door; - Darkness there, and nothing more.'),'Big Allan','Once upon a midnight dreary, while I pondered, weak and weary, Over many a quaint and curious volume of forgotten lore, While I nodded, nearly napping, suddenly there came a tapping, As of some one gently rapping, rapping at my chamber door. "Tis some visitor," I muttered, "tapping at my chamber door - Only this, and nothing more." Ah, distinctly I remember it was in the bleak December, And each separate dying ember wrought its ghost upon the floor. Eagerly I wished the morrow - vainly I had sought to borrow From my books surcease of sorrow - sorrow for the lost Lenore - For the rare and radiant maiden whom the angels name Lenore - Nameless here for evermore. And the silken sad uncertain rustling of each purple curtain Thrilled me - filled me with fantastic terrors never felt before; So that now, to still the beating of my heart, I stood repeating, "Tis some visitor entreating entrance at my chamber door - Some late visitor entreating entrance at my chamber door - This it is, and nothing more."  Presently my soul grew stronger; hesitating then no longer, "Sir," said I, "or Madam, truly your forgiveness I implore; But the fact is I was napping, and so gently you came rapping, And so faintly you came tapping, tapping at my chamber door, That I scarce was sure I heard you"- here I opened wide the door; - Darkness there, and nothing more.');

Good, we now have the same data, but this time we stored the text of the original document alongside the "vectorized" version.

The reason for this being that this ts_headline() function searches in the original documents (being our article column) rather that in your ts_vector column. Two arguments are mandatory: the original article and the ts_query. The optional arguments are the full text configuration you wish to use and a string of additional, comma separated options.

But first, let us take a look at its most basic usage:

SELECTtitle,ts_headline(article,keywords)asresultFROMphraseTable,to_tsquery('door & gently')askeywordsWHEREphrase@@keywords;

Will give you:

   title     |                                                      result                                                      
--------------+------------------------------------------------------------------------------------------------------------------
Tiny Allan   | <b>gently</b> rapping, rapping at my chamber <b>door</b>. "Tis some visitor," I muttered, "tapping at my chamber
Small Allan  | <b>gently</b> rapping, rapping at my chamber <b>door</b>. "Tis some visitor," I muttered, "tapping at my chamber
Medium Allan | <b>gently</b> rapping, rapping at my chamber <b>door</b>. "Tis some visitor," I muttered, "tapping at my chamber
Big Allan    | <b>gently</b> rapping, rapping at my chamber <b>door</b>. "Tis some visitor," I muttered, "tapping at my chamber

As you can see, we get back a short excerpt of each verse with the tokens of interest surrounded with a HTML "<b>" tag. That is actual all there is to this function, it return the results with the tokens highlighted.

However, there are some nice options you can set to alter this basic behavior.

The first one up is the HTML tag you wish to put around you highlighted words. For this you have two variables StartSel and StopSel. If we wanted this to be a "<em>" tag instead, we could tell the function to change as follows:

SELECTtitle,ts_headline(article,keywords,'StartSel=<em>,StopSel=</em>')asresultFROMphraseTable,to_tsquery('door & gently')askeywordsWHEREphrase@@keywords;

And now we will get back an <em> instead of a <b> (including just one row this time):

   title     |                                                        result                                                        
--------------+----------------------------------------------------------------------------------------------------------------------
Tiny Allan   | <em>gently</em> rapping, rapping at my chamber <em>door</em>. "Tis some visitor," I muttered, "tapping at my chamber

In fact, it does not need to be HTML at all, you can put (almost) any string there:

SELECTtitle,ts_headline(article,keywords,'StartSel=foobar>>,StopSel=<<barfoo ')asresultFROMphraseTable,to_tsquery('door & gently')askeywordsWHEREphrase@@keywords;

Result:

   title     |                                                               result                                                               
--------------+------------------------------------------------------------------------------------------------------------------------------------
Tiny Allan   | foobar>>gently<<barfoo rapping, rapping at my chamber foobar>>door<<barfoo. "Tis some visitor," I muttered, "tapping at my chamber

Quite awesome!

Another attribute you can tamper with is how many words should be included in the result set by using the MaxWords and MinWords:

SELECTtitle,ts_headline(article,keywords,'MaxWords=4,MinWords=1')asresultFROMphraseTable,to_tsquery('door & gently')askeywordsWHEREphrase@@keywords;

Which gives you:

   title     |             result             
--------------+--------------------------------
Tiny Allan   | <b>gently</b> rapping, rapping

To make the resulting headline a little bit more readable there is an attribute in this options string called ShortWord which tells the function which is the shortest word that may appear at the start or end of the headline.

SELECTtitle,ts_headline(article,keywords,'ShortWord=8')asresultFROMphraseTable,to_tsquery('door & gently')askeywordsWHEREphrase@@keywords;

Will give you:

   title     |                                                                                   result                                                                                    
--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Tiny Allan   | <b>gently</b> rapping, rapping at my chamber <b>door</b>. "Tis some visitor," I muttered, "tapping at my chamber <b>door</b> - Only this, and nothing more."
Small Allan  | <b>gently</b> rapping, rapping at my chamber <b>door</b>. "Tis some visitor," I muttered, "tapping at my chamber <b>door</b> - Only this, and nothing more." Ah, distinctly

Now it will try and set word boundaries to words of minimal 8 letters. This time I included the second line of the result set. As you can see the engine could not find an 8 letter word at the remainder of the document, so it simply prints it until the end. The second row, "Small Allan" is a bit bigger and the word "distinctly" has more then 8 letters, so is set as the boundary,

So far the headline function has given us almost full sentences and not really fragments of text. This is because the optional MaxFragments defaults to 0. If we up this variable, it will start to include fragments and not sentences. Let us try it out:

SELECTtitle,ts_headline(article,keywords,'MaxFragments=2,MaxWords=8,MinWords=1')asresultFROMphraseTable,to_tsquery('door & gently')askeywordsWHEREphrase@@keywords;

Gives you

   title     |                                                     result                                                      
--------------+-----------------------------------------------------------------------------------------------------------------
Tiny Allan   | <b>gently</b> rapping, rapping at my chamber <b>door</b>
...
Big Allan    | <b>gently</b> rapping, rapping at my chamber <b>door</b> ... chamber <b>door</b> - This it is, and nothing more

I include only the first and last line of this result set. As you can see on the last line, the result is now fragmented, and we get back different pieces of our result. If, for instance, four or five tokens match in our document, setting the MaxFragments to a higher value will show more of these matches glued together.

Accompanying this MaxFragments option is the FragmentDelimiter variable which is used to define, well, the delimiter between the fragments. Short demo:

SELECTtitle,ts_headline(article,keywords,'MaxFragments=2,FragmentDelimiter=;,MaxWords=8,MinWords=1')asresultFROMphraseTable,to_tsquery('door & gently')askeywordsWHEREphrase@@keywords;

You will get:

   title     |                                                   result                                                    
--------------+-------------------------------------------------------------------------------------------------------------
Big Allan    | <b>gently</b> rapping, rapping at my chamber <b>door</b>;chamber <b>door</b> - This it is, and nothing more

Including only the last line, you will see we now have a semicolon (;) instead of a ellipses (...). Neat.

A final, less common option for the ts_headline() function is to ignore all the word boundaries we set before and simply return the whole document and highlight all the words of relevance. This variable is called HighlightAll and is a Boolean set to false by default:

SELECTtitle,ts_headline(article,keywords,'HighlightAll=true')asresultFROMphraseTable,to_tsquery('door & gently')askeywordsWHEREphrase@@keywords;

The result would be too large to print here, but try it out. It will give you the whole text, but with the important tokens decorated with the element (or text) of choice.

A big word of caution

It is very fun to play with highlighting your results, I will admit that. The only problem is, as you might have concluded yourself, this is a potential performance grinder.

The problem here is that this function cannot use any indexes and it can also not use your stored tsvector. It needs the original document text and it needs to not only parse the whole document text to a tsvector for matching, it also needs to parse the original document text a second time to find the substrings and decorate them with the characters you have set. And this whole process has to happen for every single record in your result set.

Highlighting, with this function, is a very expensive to do.

This does not mean that you have to avoid this function, if so I would have told you from the start and skipped this whole part. No, it is there to be used. But use it in a correct way.

A correct way often seen is to use the highlighting only on the top results you are interested in - the top results the user has on their screen at the moment. This could be achieved in SQL with a so called subquery.

SELECTtitle,ts_headline(article,keywords)asresult,rankFROM(SELECTkeywords,article,title,phrase,ts_rank(phrase,keywords)ASrankFROMphraseTable,to_tsquery('door & gently')ASkeywordsWHEREphrase@@keywordsORDERBYrankLIMIT2)ASalias;

For those unfamiliar, a subquery is nothing more than a query within a query (queue Inception drums...sorry).

You evaluate the inner query and use the result set of that to perform the outer query. You can achieve the same with two queries, but that would prove not to be as elegant. When PostgreSQL sees a subquery, it can plan and execute more efficiently then with separate queries, many times giving you a better performance.

The query you see above might look a bit frightening to beginning SQL folk, but simply see it as two separate ones and the beast becomes a tiny mouse. Unless you are afraid of mice, let it become a...euhm...soft butterfly gliding on the wind instead.

In the inner query we perform the actual matching and ranking as we have seen before. This inner query then only returns two matching records, because of the LIMIT clause. The outer query takes those results and performs the expensive operation of highlighting.

Indexing

Back to a more serious matter, the act of indexing.

If you do not know what an index is, you have to brush up real fast, for indexing is quite important for the performance of your queries. In a very simplistic view, an index is like a chapter listing in a book. You can quickly skim over the chapters to find the page you are looking for, instead of having to flip over every single page.

You typically put indexes on tables which are consulted often and you build the index in a way that is in parallel with how you query them.

As indexing is a whole topic, or rather, a whole profession of its own, I will not go too deeply into the matter. But I will try to give you some basic knowledge on the subject.

Note that I will go over this matter in lighting speed and thus have to greatly skim down on the amount of details. A very good place to learn about indexes is Markus Winand's Use The Index, Luke series. I seriously suggest you read that stuff, it is golden knowledge for every serious developer working with databases.

B-tree

Before we can go to the PostgreSQL full text index types we first have to look at the most common index type, the Binary tree or B-tree.

The B-tree is a proven "computer science" concept that give us a way to search certain types of data, fast.

A B-tree is a tree structure with a root, nodes and leafs (inverse from a natural tree). The data that is within your table rows will be ordered and chopped up to fit within the tree.

In database indexes we mostly use balanced B-trees, meaning that each level of the tree has the same amount of nodes.

Take this picture for example:

            |root|
              |
       ----------------
       |               |
    |node|          |node|
       |               |
  ----------       --------- 
  |        |       |       |
|node|  |node|  |node|  |node|

In B-tree terms, we summarize this tree by saying:

It has an order of 2, meaning that each node will contain two leaves only
It has a depth of 3, meaning it is three levels deep (including the root node)
It has 4 leaves, meaning that the amount of nodes that do not contain children is 4 (bottom row)

If you set the order of your tree to a higher number, more nodes can fit onto a single row and you will end up with a lesser depth.

Now the actual power of an index comes from an I/O perspective. As you know (or will know now) the thing that will slow down a program/computer the most is I/O. This can be network I/O, disk I/O, etc. In case of our database we will speak of disk I/O.

When a database has to go and scan your table without an index is has to plow through all your rows to find a match. Database rows are almost always not I/O optimized, this means that they do not fit well in the blocks of your physical disks structure. This, in short, means that there is a lot of overhead in reading through that physical data,

A B-tree on the other hand, is very optimized for I/O. Each level of a B-tree will try and fit perfectly within one physical block on your disk. If all levels fit within one block each, walking over the tree will be very efficient and have almost no overhead.

B-trees work with the most common data types such as TEXT, INT, VARCHAR, ... .

But because full text search in PostgreSQL is its own "thing" (using the @@ operator), all knowledge that you may have learned about regarding indexes does not apply (or not in full anyway) to full text search.

Full text search needs its own kind of indexing for a tsquery to be able to use them. And as we will see in a moment, indexing on full text in PostgreSQL is a dance of trade-offs. When it comes to this matter we have two types of indexes available: GiST and GiN which are both closely related to the B-tree.

GiST

GiST stands for Generalized Search Tree and can both be set on tsvector and tsquery column types, though most of the time you will use it on the former.

The GiST itself is not something that is unique to PostgreSQL, it is a project on its own and its concept is laid out in a C library called libGist. You could go ahead and play around with libGiist to get a better understanding of how it works, it even comes shipped with some demo applications.

Over time there have come many new types of trees based on the B-tree concept, but most of them are limit in how they can match. A B-tree and its direct descendants can only use basic match operators like "<", ">", "=", etc. A GiST index, however, has more advanced matching capabilities like "intersect" and in case of PostgreSQL's implementation: the "@@" operator.

Another big advantage of the GiST is the fact that it can store arbitrary data types and therefor can be used in a wide area of conduct. The trade off for the wide data type support is the fact that GiST will always return a no if there is no match or a maybe if there is. There is no true hit with this kind of index.

Because of this behavior there is extra overhead in the case of full text search because PostgreSQL has to manually go and check all the maybe's that return and see if they are an actual match.

The big advantages of GiST are the fact that the index builds faster and the update of such an index is less expensive then the next index type we will see.

GiN

The second index candidate we have at our disposal is the Generalized Inverted Index or GiN in short.

Same as we saw with GiST, GiN also allows for arbitrary data types to be indexes and allows for more matching operators to be used. But as opposed to GiST, a GiN index is deterministic - it will always return a true match, cutting the checking overhead needed with GiST.

Well, unless you wish to use weights in your queries. A GiN index does not store lexeme weights. This means that, if weights need to be taken into account when querying, PostgreSQL still has to go and fetch the actual row(s) that return a true match, giving you somewhat of the same overhead as with a GiST index.

GiN tries to improve the B-tree concept by minimizing the amount of redundancy within nodes and there leaves. When you search for a number between 0 and 1000, it can be that your index has to go over 5 levels to find the desired entry. This means that the four levels above the matching leaf could potentially contain an (implied) reference to the row id you wish to have fetched. In a GiN index, this is generalized by storing a single entry of the duplicates into so-called posting trees and posting lists and pointing to those lists instead of drilling down multiple levels.

The downside of GiN is the fact that this kind of index will slow down the bigger it gets.

On a more positive note, GiN indexes are most of the time smaller on disk (because of it trying to reduce duplicates). And, as of PostgreSQL 9.4, they will be even smaller. The soon-to-be version will introduce a so-called varbyte version of GiN. For now just take it from me that it will make these type of indexes much smaller, and even more efficient.

As you can see, there is no perfect index when it comes to full text. You will have to carefully look at what data you will save and how you wish to query the data.

If you do not update your database much but you have a lot of querying going on, GiN might be a better option for it is much faster with a lookup (if no weights are required). If your data does not get read much, but is updated frequently, maybe a GiST is a better choice for it allows for faster updating.

Making an index

We have (very roughly) seen what an index is and what we have available for full text, but how do you actually build such an index?

Luckily for us, this too has been neatly abstracted and is very simple to do.

If we wanted our phraseTable to contain an index, we simply could go about and create it with the following syntax:

CREATEINDEXprhasetable_idxONphraseTableUSINGgin(phrase);

This will create a GiN index called phrasetable_ixd on the column phrase.

Just like we did before, we will now re-populate our phrase column, but this time we will fill it with the data we want to have indexed: article and title. Let me show you what I mean.

First, empty the four phrase columns in our tiny database:

ALTERTABLEphraseTableALTERphraseDROPNOTNULL;UPDATEphraseTablesetphrase=NULL;

Notice that I removed the NOT NULL constraint. Next we can populate it containing a tsvector of both the title and the article columns:

UPDATEphraseTableSETphrase=to_tsvector('english',coalesce(title,'')||' '||coalesce(article,''));

The coalesce function may be something that you are unfamiliar with. This functions simply returns the first argument which is not NULL. In this case we use:

coalesce(title,'')

Which means that if title would be NULL it will return the empty string '' which never is NULL. We use coalesce here to substitute a value for NULL, being the empty string.

If we would not substitute NULL then our tsvector generation would fail if either the title or article column would be NULL.

Next we can create an index on that newly filled column:

CREATEINDEXphraseable_idxONphraseTableUSINGgin(phrase);

And we have magic, there now is a GiN index on that column which will be used during full text search.

To create a GiST index we could use exactly the same syntax:

CREATEINDEXphrasetable_idxONphraseTableUSINGgist(phrase);

Now, the disk-space savvy readers will have noticed that our "phrase" column now contains some redundant information as we store the tsvector of the article and title column that was already in the database. If you do not wish to have this extra column, you could created expression indexes (our on-the-fly queries we seen before).

The setup of such an expression index is trivial:

CREATEINDEXphrasetable_exp_idxONphraseTableUSINGgin(to_tsvector('english',coalesce(title,'')||' '||coalesce(article,'')));

Instead of having this extra tsvector column around, we now have created an on-the-fly index using the same syntax as we employed when we populated the phrase column a few lines back.

One important thing to note when you use expression indexes is the text search configuration you used. Here we specify that we wish the index to be created using the 'english' configuration set. This results in an index which is configuration aware and will only work with a query which has the same configuration set fed to the tsquery function (well, the same name anyway).

You could omit the configuration which would then default to the one set in the "default_text_search_config" variable we saw in the last chapter. The problem you will have then is that the index is created using a configuration that could be altered after the index was created. If we later would query the database with the altered default, the index would be useless and will return inaccurate results.

Also note that we may save on disk space when we use the expression index, but we do not save on CPU. Now, instead of indexing data already parsed and ready in a column, the index has to compute the to_tsvector on every index match. Again, a world of trade-offs.

Triggers

A final, small topic I want to briefly touch on before I let you go free are update triggers. The way we have been populating our database so far does not need a trigger actually. Up until now we have been inserting records (or updating them) using the ts_tsvector() function. The negative aspect of going about it the way we did is that it is extremely redundant.

If we inserted a piece of Raven text into a record, we specified it twice, one time for the article column and one time for the phrase column which holds the tsvector result.

A better way to do this is to not let the insert query care about the tsvector at all. We simply insert the text we like and let the database do the converting behind the curtains.

This is where a trigger comes in handy. Actually, PostgreSQL has a whole set of trigger functions available that will fire when certain conditions are met, but when it comes to full text we have two functions at our disposal.

tsvector_update_trigger()

The first, and most used one, is called tsvector_update_trigger() and fires whenever a new row is inserted into your table (in our case phraseTable).

To setup such a trigger, we could use the following SQL:

CREATETRIGGERtsvectorupdateBEFOREINSERTORUPDATEONphraseTableFOREACHROWEXECUTEPROCEDUREtsvector_update_trigger(phrase,'pg_catalog.english',title,article);

That is all you need to setup such a trigger. Let us see what we just did.

First, we have new syntax staring us in the face: CREATE TRIGGER. This will create a trigger on certain events. The events here are BEFORE INSERT and BEFORE UPDATE which are contracted to BEFORE INSERT OR UPDATE. Then we specify on which table this trigger has to act and for each ROW. Then we say we want to EXECUTE PROCEDURE, which, in our case, is the function tsvector_update_trigger().

The function itself needs a bit of explaining as well. This version takes three required arguments: the tsvector column name, the full text configuration name and the original text column name. The latter can be multiple columns to concatenate them together. This concatenation is done with coalesce under the hood, as we have seen before.

In our case, we create a trigger that takes the phrase tsvector column, the enlgish full text configuration and concatenates the text from both title and article to be normalized into lexemes.

Note that instead of english we say pg_catalog.english when providing this function with the full text configuration. In case of this function (and the next) we have to provide the schema-qualified path to the configuration.

tsvector_update_trigger_column()

The other of the two full text trigger functions we have is called tsvector_update_trigger_column() and has only one difference to the former: the full text configuration used. Here, the full text configuration can be read from a column instead of given directly as a string.

A possibility we have not seen in this series is one where you can have yet another column in your phraseTable where you store the name of the full text configuration you wish to use. This way you can store multiple "languages" within the same table, specifying which configuration to use with each row.

This trigger functions can take into account these per-row differing configurations and is able to read them from the specified column.

But we have a trade-off once more. These two trigger functions, which are officially called example functions again (remember our ranking functions?), do not take into account weights. If you have the need to store different weights in your tsvectors, you will have to write you own trigger function.

The end

Okay, I guess this covers the basics of full text within PostgreSQL.

We have covered the most important parts and touched some segments deeply, others just with a soft lovers glove. As I always say at the end of such lengthy chapters: go out and explore.

I have tried to give you a solid, full text knowledge base to build further adventures on. I highly encourage you to pack your elephant, take your new ship for a maiden voyage, set high the sails and if certain blue wales try to swim next to your vessel, simply let the mammoth take a good relief down the ship's head, and let those turds float together with our squeaky finned friends!

And as always...thanks for reading!

↧

Michael Paquier: Postgres 9.4 feature highlight: MSVC installer for client binaries and libraries

May 14, 2014, 5:41 am

≫ Next: Dimitri Fontaine: Why is pgloader so much faster?

≪ Previous: Tim van der Linden: PostgreSQL: A full text search engine - Part 3

Today here is a highlight of a new Postgres 9.4 feature interesting for developers and companies doing packaging of Postgres on Windows as it makes possible the installation of client-only binaries and libraries using MSVC. It has been introduced by this commit:

commit a7e5f7bf6890fdf14a6c6ecd0854ac3f5f308ccd
Author: Andrew Dunstan <andrew@dunslane.net>
Date:   Sun Jan 26 17:03:13 2014 -0500

Provide for client-only installs with MSVC.

MauMau.

Enters in the client package all the binaries used to interact directly with the server (psql, pg_dump, pgbench) and the interface libraries (libpq, ecpg).

Documentation precisely describes how to set up an environment to compile PostgreSQL on Windows, so in short here is the new command that you can use from src/tools/msvc in for example a Windows SDK command prompt:

install c:\install\to\path client

The command "install" will install by default everything if no keyword is specified. As a new behavior, the keyword "all" can be used to install everything, meaning that the following commands are equivalent:

install c:\install\to\path
install c:\install\to\path all

After the client installation, you will get the following things installed:

$ ls /c/install/to/path/bin/
clusterdb.exe   droplang.exe   oid2name.exe        pg_isready.exe      reindexdb.exe
createdb.exe    dropuser.exe   pg_basebackup.exe   pg_receivexlog.exe  vacuumdb.exe
createlang.exe  ecpg.exe       pg_config.exe       pg_restore.exe      vacuumlo.exe
createuser.exe  libpq.dll      pg_dump.exe         pgbench.exe
dropdb.exe      oid2name.exe   pg_dumpall.exe      psql.exe
$ ls /c/install/to/path/lib/
libecpg.dll  libecpg_compat.dll  libpgcommon.lib  libpgtypes.dll  libpq.dll  postgres.lib
libecpg.lib  libecpg_compat.lib  libpgport.lib    libpgtypes.lib  libpq.lib

This is going to simplify a bit more the life of Windows packagers who have up to now needed custom scripts to install client-side things only. So thanks MauMau!

↧