David Z: How to run a specific regression test

November 26, 2021, 4:31 pm

≫ Next: Andreas 'ads' Scherbaum: Fabien Coelho

≪ Previous: Regina Obe: PostGIS 3.2.0beta2 Released

1. Overview

I have been working on an internal project based on PostgreSQL for a while, and from time to time, I need to run some specific test cases to verify my changes. Here, I want to shared a tip to run a specific regression TAP test quickly, especially, when you are focusing on a particular bug and you know which test case can help verify the fix. A details document about the regression test can be found at Running the Tests.

2. Regression test

PostgreSQL provides a comprehensive set of regression tests to verify the SQL implementation embedded in PostgreSQL as well as the extended capabilities of PostgreSQL. Whenever you make some changes, you should run these existing test cases to make sure your change doesn’t break any existing features. Other than these regression tests, there are some special features using a test framework call TAP test. For example, kerberos, ssl, recovery etc.

If you want to run these tests, you have to make sure the option --enable-tap-tests has been configured. for example,
./configure --prefix=$HOME/pgapp --enable-tap-tests --enable-debug CFLAGS="-g3 -O0 -fno-omit-frame-pointer"

You can run the TAP test using either make check or make installcheck, but compared with those non-TAP tests, the different is that these TAP tests will always start a test server even you run make installcheck. Because of this different, some tests may take a longer time than you expected, and even worse, if some test cases failed in the middle then the entire test will stop, and your test cases may never get the chance to run. For example, I changed somethings related to the recovery features, and those changes suppose to be tested by test cases 021_row_visibility.pl and 025_stuck_on_old_timeline.pl, but whenever I run make check or make installcheck, it ends up with something like below.

t/001_stream_rep.pl .................. ok     
t/002_archiving.pl ................... ok   
t/003_recovery_targets.pl ............ ok   
t/004_timeline_switch.pl ............. ok   
t/005_replay_delay.pl ................ ok   
t/006_logical_decoding.pl ............ ok     
t/007_sync_rep.pl .................... ok     
t/008_fsm_truncation.pl .............. ok   
t/009_twophase.pl .................... ok     
t/010_logical_decoding_timelines.pl .. ok     
t/011_crash_recovery.pl .............. ok   
t/012_subtransactions.pl ............. ok     
t/013_crash_restart.pl ............... ok     
t/014_unlogged_reinit.pl ............. ok     
t/015_promotion_pages.pl ............. ok   
t/016_min_consistency.pl ............. ok   
t/017_shm.pl ......................... ok   
t/018_wal_optimize.pl ................ ok     
t/019_replslot_limit.pl .............. 11/20 Bailout called.  Further testing stopped:  pg_ctl start failed
FAILED--Further testing stopped: pg_ctl start failed
Makefile:23: recipe for target 'check' failed
make: *** [check] Error 255

Now, 019_replslot_limit.pl always failed in the middle, but those test cases to verify my changes haven’t got the chance to run yet.

3. How to run a specific test?

To run a specific test cases, the key is to use a variable PROVE_TESTS provided by PostgreSQL. Details can be found at TAP Tests. This PROVE_TESTS variable allow to define a whitespace separated list of paths to run the specified subset of tests instead of the default t/*.pl. For example: in above case, you can run make check PROVE_TESTS='t/021_row_visibility.pl t/025_stuck_on_old_timeline.pl'. It will run these two test cases directly. The output is something like below.

recovery$ make check PROVE_TESTS='t/021_row_visibility.pl t/025_stuck_on_old_timeline.pl'
make -C ../../../src/backend generated-headers
make[1]: Entering directory '/home/sandbox/sharedsm/src/backend'
make -C catalog distprep generated-header-symlinks
make[2]: Entering directory '/home/sandbox/sharedsm/src/backend/catalog'
make[2]: Nothing to be done for 'distprep'.
make[2]: Nothing to be done for 'generated-header-symlinks'.
make[2]: Leaving directory '/home/sandbox/sharedsm/src/backend/catalog'
make -C utils distprep generated-header-symlinks
make[2]: Entering directory '/home/sandbox/sharedsm/src/backend/utils'
make[2]: Nothing to be done for 'distprep'.
make[2]: Nothing to be done for 'generated-header-symlinks'.
make[2]: Leaving directory '/home/sandbox/sharedsm/src/backend/utils'
make[1]: Leaving directory '/home/sandbox/sharedsm/src/backend'
rm -rf '/home/sandbox/sharedsm'/tmp_install
/bin/mkdir -p '/home/sandbox/sharedsm'/tmp_install/log
make -C '../../..' DESTDIR='/home/sandbox/sharedsm'/tmp_install install >'/home/sandbox/sharedsm'/tmp_install/log/install.log 2>&1
make -j1 checkprep >>'/home/sandbox/sharedsm'/tmp_install/log/install.log 2>&1
rm -rf '/home/sandbox/sharedsm/src/test/recovery'/tmp_check
/bin/mkdir -p '/home/sandbox/sharedsm/src/test/recovery'/tmp_check
cd . && TESTDIR='/home/sandbox/sharedsm/src/test/recovery' PATH="/home/sandbox/sharedsm/tmp_install/home/sandbox/pgapp/bin:$PATH" LD_LIBRARY_PATH="/home/sandbox/sharedsm/tmp_install/home/sandbox/pgapp/lib:$LD_LIBRARY_PATH" PGPORT='65432' PG_REGRESS='/home/sandbox/sharedsm/src/test/recovery/../../../src/test/regress/pg_regress' /usr/bin/prove -I ../../../src/test/perl/ -I . t/021_row_visibility.pl t/025_stuck_on_old_timeline.pl
t/021_row_visibility.pl ……… ok
t/025_stuck_on_old_timeline.pl .. ok
All tests successful.
Files=2, Tests=11, 13 wallclock secs ( 0.01 usr 0.00 sys + 1.73 cusr 4.03 csys = 5.77 CPU)
Result: PASS

Of course, if you know the makefile very well, you can also do it on your own way. For example, by looking at the output, you can simply do in below steps to achieve the same results.

rm -rf '/home/sandbox/sharedsm/src/test/recovery'/tmp_check

mkdir -p '/home/sandbox/sharedsm/src/test/recovery'/tmp_check

recovery$ cd . && TESTDIR='/home/sandbox/sharedsm/src/test/recovery' PATH="/home/sandbox/pgapp/bin:$PATH" PGPORT='65432' top_builddir='/home/sandbox/sharedsm/src/test/recovery/../../..' PG_REGRESS='/home/sandbox/sharedsm/src/test/recovery/../../../src/test/regress/pg_regress' /usr/bin/prove -I ../../../src/test/perl/ -I .  t/021_row_visibility.pl 
t/021_row_visibility.pl .. ok     
All tests successful.
Files=1, Tests=10,  5 wallclock secs ( 0.02 usr  0.00 sys +  0.81 cusr  1.42 csys =  2.25 CPU)
Result: PASS

4. Summary

In this blog, I explained how to run a specific test case by using variable PROVE_TESTS for TAP test. You can also run the test manually to skip some tests either take too much time or may failed in the middle and block your test cases.

David Zhang

A software developer specialized in C/C++ programming with experience in hardware, firmware, software, database, network, and system architecture. Now, working in HighGo Software Inc, as a senior PostgreSQL architect.

idrawone.github.io

The post How to run a specific regression test appeared first on Highgo Software Inc..

↧

Andreas 'ads' Scherbaum: Fabien Coelho

November 28, 2021, 10:00 pm

≫ Next: Vincenzo Romano: A case involving Postgres and some carelessness

≪ Previous: David Z: How to run a specific regression test

PostgreSQL Person of the Week Interview with Fabien Coelho: I’m French, born and raised in Paris over 50 years ago. I work in Fontainebleau and Paris, and live in the Centre Val de Loire region, along the Loire river.

↧

Vincenzo Romano: A case involving Postgres and some carelessness

November 29, 2021, 6:17 am

≫ Next: Ibrar Ahmed: PostgreSQL 14 Database Monitoring and Logging Enhancements

≪ Previous: Andreas 'ads' Scherbaum: Fabien Coelho

The case

I have been involved into a case where an important application has been behaving erratically in regards of execution time. This is the background.

The application is customised installation of a very-well-known analtics suite in use at a very large shipping and logstics Dutch company. This has been deployed in AWS with the back storage provided by a rather large Postgres RDS both in acceptance and in production environment, with the former being a slimmed down installation when compared to the latter. The UI is web-based.
The company I work with is in charge for the cloud infra, and that alone.

The particular issue happens when browsing a list of postal addresses. This very function is used to browse and search among about 9 million postal addresses all around the country (the Netherlands for the records). To my personal opinion and experience this is not a really darge data set. At the very start, this function displays the first 100 addresses with the usual search boxes, paging widgets and a number that represents the overall counter. You can click on an address and get an expanded view. For some unclear reasons the problem has been moved to the company responsible for the infrastructure.

In acceptance the function seems to work quite well 95% of the times: the first page appears almost instantaneously but 5 times out of 100 it can take up to two minutes (yes, 120 seconds!) before displaying any data. In production, counterintuitively, it’s the other way around: the startup takes a very log time to display the first page of data. Subsequent pages appear almost instantaneously in either environment.

The whole investigation was time-boxed to just 1 working day.

First ask for the logs!

As usual I tried to gather as much details as possible about the application itself. It’s a name I have heard and seens in almost all my previous working experiences from the past 10 years, but this is my very first time to look at it this deep inside.
According to the customer the specific function is written in Java (I think likely almost all of the application is Java) and is hosted under the Apache suite. There is a load balancer in front of a multi-availability zone setup. Thus it’s far from an unusual setup.

Despite asking over and over, I had no chance to get to the application sourcecode itself. So I focused on the one and only part I can freely access: the DB.
I got immediately to the configuration parameters to log query execution time. The aim is double: know how the application interacts with the database and know how long it takes to execute those queries. The friendly manual told me how to. AWS RDS requires you to setup a parameter group, use it to configure the DB and to reboot the instanc to make it in use. So I did my setup:

log_min_duration_statement = 0 # (ms) Sets the minimum execution time above which statements will be logged.
log_statement = all # Sets the type of statements logged.
log_duration = 1 # (boolean) Logs the duration of each completed SQL statement.

Basically I wanted to know which queries were run, all of them, and how long they took from the DB perspective no matter the actual duration.

Of course, a few interesting things popped up. Let’s have a look.

The findings

First I had to figure out how the tables are named. I could have run my faithful PgAdmin tool but a few searches with vim editor over the log files was effective enough.
The main table used for those addresses is called ebx_pdl_address and to get the number of its rows the actual query is:

SELECT n_live_tup FROM pg_stat_all_tables WHERE relname = lower('ebx_pdl_address')

Nice. No count(*). Of course. Instead of counting the lines in a table the application is asking the DB for an estimate of the “number of live rows” in that table. So we will get whatever the DB staistics keeper thinks it’s the number of rows. This number can be different from the actual number. This is done because in Postgres the count aggregate function “requires effort proportional to the size of the table: PostgreSQL will need to scan either the entire table or the entirety of an index that includes all rows in the table“. The programmer has decided that and estimate is more than enough. I cannot agree more on this decision as the user won’t bother to be off even by a few percentages over 9+ million rows.

Soon after I found the query I was looking for.

SELECT
  t0.T_LAST_USER_ID, t0.T_CREATOR_ID, t0.T_CREATION_DATE, t0.T_LAST_WRITE,
  t0.general_addressid, t0.general_housenumber, t0.general_housenumberaddition,
  t0.general_postalcodearea_, t0.general_street_, t0.general_status,
  t0.general_addresstype, t0.general_addressableobjecttype,
  t0.general_parcelnumber, t0.general_parcelcode__addresstype,
  t0.general_parcelcode__parcelcode, t0.general_controlindicator,
  t0.general_letterboxnumber, t0.general_deliverypointnumber,
  t0.general_deliverynumber, t0.general_qualityletterbox,
  t0.general_additionalrouteflat, t0.general_additionalroutestaircase,
  t0.general_additionalrouteelevator, t0.general_additionalroutetype,
  t0.general_startdate, t0.general_enddate, t0.general_lastmutationdate,
  t0.general_versionstartdate, t0.general_cityidbag,
  t0.general_municipalityidbag, t0.general_publicspacebag, t0.general_pandidbag,
  t0.general_statuspandbag, t0.general_verblijfsobjectidbag,
  t0.general_statusverblijfsobjectbag, t0.general_numberdisplaybag,
  t0.general_statusnummeraanduidingbag, t0.general_gebruiksdoelbag,
  t0.general_bagindexpchnrht, t0.general_bagindexwplstreethnrht,
  t0.general_bagindexwplbagopenbareruimteidhnrht,
  t0.general_frominitialdataload, t0.general_changedindicator
FROM ebx_pdl_address t0
ORDER BY t0.general_addressid; -- This is a primary key

Please, spend a minute or two over it.

Basically, this query is pulling all 43 (why not 42?) columns from all 9+ million rows in that table. At once.
No LIMIT/OFFSET, no WHERE condition, no filtering JOIN.
The duration is little more that 128,000 milliseconds. More than 2 minutes!
If a single line weighed a mere 256 bytes (a quarter of a KiB, 6 bytes per columns on average), we would be transferring from the DB to the client more than 2 GiB of data, possibly over an SSL-encrypted channel! If we did that over a 1Gbps ethernet that would require not less than 20 seconds.
A quick check with PgAdmin confirmed me that it is a real table (not a view) and that the column size is about 700 bytes.

Now, let’s think about the software architecture and make some (possibly reasonable) assumption.
We have an RDS, talking to a Java application triggered by a web server. In front of which there is a web load balancer and, finally, the client.
In the case tha paged view of the data set is managed by the Java application itself, then the data is copied over the network only once and cached into the application where the paging happens.
In case the paged view is managed by browser, then there would possibly be two more data copies over the network. I am assuming that internal communication within the Apache stack is shared memory-based, with no copy at all.

I tried to run the same query from a client sitting in the same account as the RDS via the psql command line tool. At best that query required 45 seconds to run. At worst I got the client aborting due to insufficient memory.

Actual whiteboard drawing used during the discussion with the customer

I got flabbergasted! I really could not believe my eyes.

The erratic execution delay could have been caused by some application caching. Or even browser caching for what I could know. For sure copying even a single GiB of data from DB to the application seems a näive solution at least.

Moreover I found out that in acceptance there were very few updates to that table, while in production that table was reveicing frequent updates. If my caching assumption was true,frequent updates would lead to frequent cache invalidation and new queries to be run.

I briefly discussed this with my colleague involved into this story and wrapped up a conclusion presentation with the above picture attached to it.

Conclusions to the customer

These are the questions the customer asked and the replies I gave during the final session.

Q. Can this problem be solved with a larger RDS instance?
A. From the data I have got I cannot suggest for a larger RDG instance.

Q. Is there any Postgres setting we can tune in order to get better performance?
A. From the data I have got I cannot suggest for configuration tuning that can lead to performance increse better than a few percentage points.

Q. Is there any change that can be applied to the indexes in order to mitigate the problem?
A. No, there is none. The only one in use is the one linked to the primary key ( general_addressid).

Q. Do yoiu think we can fix this by switching to <very famous and expensive SQL server>?
A. From the data I have got I cannot suggest such a change as the issue seems to be software architecture-related.

Q. How can we possibly fix this issue?
A. The minimal changes needed to try to fix the issue is to change both the application logics and the query design. The fix requires some software rework and an internal application logic change.

My own conclusions

Acceptance tests should be completely run with positive results before moving to production.

Acceptance tests should include performance tests with a clear and agreeed definition of “expected performance and behavior“.

Näive and careless application and DB design is hard to fix.

Long live to Postgres (and its logs)!

↧

Ibrar Ahmed: PostgreSQL 14 Database Monitoring and Logging Enhancements

November 29, 2021, 6:29 am

≫ Next: Franck Pachot: Text Search example with the OMDB sample database

≪ Previous: Vincenzo Romano: A case involving Postgres and some carelessness

PostgreSQL-14 Database Monitoring and Logging Enhancements

PostgreSQL-14 was released in September 2021, and it contained many performance improvements and feature enhancements, including some features from a monitoring perspective. As we know, monitoring is the key element of any database management system, and PostgreSQL keeps updating and enhancing the monitoring capabilities. Here are some key ones in PostgreSQL-14.

Query Identifier

Query identifier is used to identify the query, which can be cross-referenced between extensions. Prior to PostgreSQL-14, extensions used an algorithm to calculate the query_id. Usually, the same algorithm is used to calculate the query_id, but any extension can use its own algorithm. Now, PostgreSQL-14 optionally provides a query_id to be computed in the core. Now PostgreSQL-14’s monitoring extensions and utilities like pg_stat_activity, explain, and in pg_stat_statments use this query_id instead of calculating its own. This query_id can be seen in csvlog, after specifying in the log_line_prefix. From a user perspective, there are two benefits of this feature.

All the utilities/extensions will use the same query_id calculated by core, which provides an ease to cross-reference this query_id. Previously, all the utilities/extensions needed to use the same algorithm in their code to achieve this capability.
The second benefit is extension/utilities can use calculated query_id and don’t need to again, which is a performance benefit.

PostgreSQL introduces a new GUC configuration parameter compute_query_id to enable/disable this feature. The default is auto; this can be turned on/off in postgresql.conf file, or using the SET command.

pg_stat_activity

SET compute_query_id = off;

SELECT datname, query, query_id FROM pg_stat_activity;
 datname  |                                 query                                 | query_id 
----------+-----------------------------------------------------------------------+----------
 postgres | select datname, query, query_id from pg_stat_activity;                |         
 postgres | UPDATE pgbench_branches SET bbalance = bbalance + 2361 WHERE bid = 1; |

SET compute_query_id = on;

SELECT datname, query, query_id FROM pg_stat_activity;
 datname  |                                 query                                 |      query_id       
----------+-----------------------------------------------------------------------+---------------------
 postgres | select datname, query, query_id from pg_stat_activity;                |  846165942585941982
 postgres | UPDATE pgbench_tellers SET tbalance = tbalance + 3001 WHERE tid = 44; | 3354982309855590749

Log

In the previous versions, there was no mechanism to compute the query_id in the server core. The query_id is especially useful in the log files. To enable that, we need to configure the log_line_prefix configuration parameter. The “%Q” option is added to show the query_id; here is the example.

log_line_prefix = 'query_id = [%Q] -> '

query_id = [0] -> LOG:  statement: CREATE PROCEDURE ptestx(OUT a int) LANGUAGE SQL AS $$ INSERT INTO cp_test VALUES (1, 'a') $$;
query_id = [-6788509697256188685] -> ERROR:  return type mismatch in function declared to return record
query_id = [-6788509697256188685] -> DETAIL:  Function's final statement must be SELECT or INSERT/UPDATE/DELETE RETURNING.
query_id = [-6788509697256188685] -> CONTEXT:  SQL function "ptestx"
query_id = [-6788509697256188685] -> STATEMENT:  CREATE PROCEDURE ptestx(OUT a int) LANGUAGE SQL AS $$ INSERT INTO cp_test VALUES (1, 'a') $$;

Explain

The EXPLAIN VERBOSE will show the query_id if compute_query_id is true.

SET compute_query_id = off;

EXPLAIN VERBOSE SELECT * FROM foo;
                          QUERY PLAN                          
--------------------------------------------------------------

 Seq Scan on public.foo  (cost=0.00..15.01 rows=1001 width=4)
   Output: a
(2 rows)

SET compute_query_id = on;

EXPLAIN VERBOSE SELECT * FROM foo;
                          QUERY PLAN                          
--------------------------------------------------------------
 Seq Scan on public.foo  (cost=0.00..15.01 rows=1001 width=4)
   Output: a
 Query Identifier: 3480779799680626233
(3 rows)

autovacuum and auto-analyze Logging Enhancements

PostgreSQL-14 improves the logging of auto-vacuum and auto-analyze. Now we can see the I/O timings in the log, showing how much has been spent reading and writing.

automatic vacuum of table "postgres.pg_catalog.pg_depend": index scans: 1
pages: 0 removed, 67 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 89 removed, 8873 remain, 0 are dead but not yet removable, oldest xmin: 210871
index scan needed: 2 pages from table (2.99% of total) had 341 dead item identifiers removed
index "pg_depend_depender_index": pages: 39 in total, 0 newly deleted, 0 currently deleted, 0 reusable
index "pg_depend_reference_index": pages: 41 in total, 0 newly deleted, 0 currently deleted, 0 reusable

I/O timings: read: 44.254 ms, write: 0.531 ms

avg read rate: 13.191 MB/s, avg write rate: 8.794 MB/s
buffer usage: 167 hits, 126 misses, 84 dirtied
WAL usage: 85 records, 15 full page images, 78064 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.07 s

These logs are only available if track_io_timing is enabled.

Connecting Logging

PostgreSQL already logs the connection/disconnection if log_connections/log_disconnections is on. Therefore, PostgreSQL-14 now also logs the actual username supplied by the user. In case some external authentication is used, and mapping is defined in pg_ident.conf, it will become hard to identify the actual user name. Before PostgreSQL-14, you only see the mapped user instead of the actual user.

pg_ident.conf

# MAPNAME       SYSTEM-USERNAME         PG-USERNAME

pg              vagrant                 postgres

pg_hba.conf

# TYPE  DATABASE        USER            ADDRESS                 METHOD
# "local" is for Unix domain socket connections only
local   all             all                                     peer map=pg

Before PostgreSQL-14

LOG:  database system was shut down at 2021-11-19 11:24:30 UTC
LOG:  database system is ready to accept connections
LOG:  connection received: host=[local]
LOG:  connection authorized: user=postgres database=postgres application_name=psql

PostgreSQL-14

LOG:  database system is ready to accept connections
LOG:  connection received: host=[local]
LOG:  connection authenticated: identity="vagrant" method=peer (/usr/local/pgsql.14/bin/data/pg_hba.conf:89)
LOG:  connection authorized: user=postgres database=postgres application_name=psql

Conclusion

Every major PostgreSQL release carries significant enhancements, and PostgreSQL-14 was no different.

Monitoring is a key feature of any DBMS system, and PostgreSQL keeps upgrading its capabilities to improve its logging and monitoring capabilities. With these newly added features, you have more insights into connections; one can easily track queries and observe performance, and identify how much time is being spent by the vacuum process in read/write operations. This can significantly benefit you in configuring vacuum parameters better.

As more companies look at migrating away from Oracle or implementing new databases alongside their applications, PostgreSQL is often the best option for those who want to run on open source databases.

Read Our New White Paper:

Why Customers Choose Percona for PostgreSQL

↧

Franck Pachot: Text Search example with the OMDB sample database

November 29, 2021, 9:42 am

≫ Next: Yugo Nagata: Transition Tables in Incremental View Maintenance (Part II): Multliple Tables Modification case

≪ Previous: Ibrar Ahmed: PostgreSQL 14 Database Monitoring and Logging Enhancements

This post exposes some basic examples for one of the greatest feature of PostgreSQL: text search. In standard SQL, we can use LIKE, SIMILAR and REGEXP. But general text search cannot be optimized with simple B-Tree indexes on the column value. Text contain words, and indexing the text as a whole is not sufficient. Fortunately, PostgreSQL provides many index types, and one of them is GIN - Generalized Inverted Index. We can index the words, automatically, with functions to extract them as "tsvector" - Text Search vectors.

If you are an advanced user of PostgreSQL, you will probably not learn new things here. Except if you want to see how I use it, with views and stored function to encapsulate the functions. If you are a user of Oracle or SQL Server, you know the idea but may be surprised by how it is easy to use in an Open Source database. If you are a user of ElasticSearch, you may see that for simple searches, SQL databases can provide this without an additional service.

My goal here is to show that we can use the same on the latest version of Yugabyte (I'm using 2.11 there). YugabyteDB is a distributed SQL database that reuses the PostgreSQL query layer, which means that many features come without additional effort. However, the distributed storage is different from the monolithic postgres, using LSM Tree instead of B-Tree and Heap tables. The YugabyteDB YBGIN is similar to YugabyteDB GIN, but implemented on top of LSM Tree indexes.

PostgreSQL: HEAP, BTREE and GIN

In PostgreSQL, here is how you define an HEAP table and a GIN index:

postgres=# create table demo
           (id bigint primary key, description text)
           USING HEAP;
CREATE TABLE

postgres=# create index demo_index on demo
           ( length(description) );
CREATE INDEX

postgres=# create index demo_gin on demo
           USING GIN
           ( (to_tsvector('simple',description)) );
CREATE INDEX

postgres=# select relname,reltype,amname 
           from pg_class left outer join pg_am
           on pg_class.relam=pg_am.oid
           where relname like 'demo%';

  relname   | reltype | amname
------------+---------+--------
 demo       |   16918 | heap
 demo_gin   |       0 | gin
 demo_index |       0 | btree
 demo_pkey  |       0 | btree
(4 rows)

The storage type defined with USING is visible as Access Method, which is the the name of the extensibility layer for different storage. The default for PostgreSQL is HEAP tables, BTREE indexes, and GIN can be used for text search.

In YugabyteDB the default is LSM Tree for the indexes and table (which is stored clustered on the primary key):

yugabyte=# create table demo
           (id bigint primary key, description text)
           ;
CREATE TABLE

yugabyte=# create index demo_index on demo
           ( length(description) );
CREATE INDEX

yugabyte=# create index demo_gin on demo
           USING GIN
           ( (to_tsvector('simple',description)) );
NOTICE:  replacing access method "gin" with "ybgin"
CREATE INDEX

yugabyte=# select relname,reltype,amname 
           from pg_class left outer join pg_am
           on pg_class.relam=pg_am.oid
           where relname like 'demo%';

  relname   | reltype | amname
------------+---------+--------
 demo       |   16805 |
 demo_gin   |       0 | ybgin
 demo_index |       0 | lsm
 demo_pkey  |       0 | lsm
(4 rows)

A few comments on the differences with PostgreSQL.

there's no USING clause for the table because I'm using YB-2.11 which is PG11 compatible and table access methods came in PG12. All non-temporary tables in YugabyteDB use the distributed storage (LSM Tree).
the primary key, which is physically the same as the table, shows LSM.
regular secondary indexes are also LSM Trees
the USING GIN clause is transformed to USING YBGIN, which is the YugabyteDB implementation of it.

OMDB sample database

I'll load a sample database with many text columns: OMDB (open media database) - a free database for film media.

The https://github.com/credativ/omdb-postgresql project has a procedure to load into PostgreSQL. I can use the same to load into YugabyteDB, but this is not optimal. It is better to define the PRIMARY in the CREATE TABLE statement rather than ALTER TABLE later. What I did was load into PostgreSQL in order to export with pg_dump:

git clone https://github.com/credativ/omdb-postgresql.git
cd omdb-postgresql
./download
./import
pg_dump -f omdb.sql omdb

I have a quick script to move the PRIMARY KEY declaration into the CREATE TABLE:

awk '
/^ALTER TABLE ONLY/{last_alter_table=$NF}
/^ *ADD CONSTRAINT .* PRIMARY KEY /{sub(/ADD /,"");sub(/;$/,"");pk[last_alter_table]=$0",";$0=$0"\\r"}
NR > FNR && /^CREATE TABLE/{ print $0,pk[$3] > "omdb-pk.sql" ; next} NR > FNR { print > "omdb-pk.sql" }
' omdb.sql omdb.sql

A better solution would be using yb_dump from the YugabyteDB installation, but I want to make this post simple and the same with PostgreSQL and YugabyteDB.

Google-like Text Search Queries

To simplify queries, I create the following view that joins the movies with movies_abstract_en and adds the text search vector of words from the abstract (to_tsvector('simple',abstract)):

create or replace view movies_search as 
 select *
 from ( 
    select id as movie_id, name as movie_name from movies
 ) movies 
 natural join 
 (
    select movie_id, abstract as movie_abstract,
     to_tsvector('simple',abstract) as movie_abstract_ts
    from movie_abstracts_en
 ) movie_abstracts;

I also create a function to encapsulate the call to websearch_to_tsquery in order to query it with a google-like syntax:

create or replace function find_movie(text)
returns setof movies_search as $sql$
 select * from movies_search 
 where websearch_to_tsquery('simple',$1) 
 @@ movie_abstract_ts;
$sql$ language sql;

Here is a simple example of usage:

yugabyte=#  select * from find_movie(
            'Luke and Leia and "George Lucas"
            ');


movie_id|movie_name                                    |movie_abstract                                                                                                                                                                                                                                                 |movie_abstract_ts                                                                                                                                                                                                                                              |
-------------+----------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    1891|Star Wars: Episode V - The Empire Strikes Back|The Empire Strikes Back is considered the most morally and emotionally complex of the original Star Wars trilogy, continuing creator George Lucas's epic saga where Star Wars: Episode IV - A New Hope left off. Masterful storytelling weaves together multipl|'a':31 'against':46 'and':10,48,59 'archetypal':41 'as':50 'attempts':53 'back':4 'capture':55 'complex':12 'considered':6 'continuing':19 'creator':20 'desperately':52 'emotionally':11 'empire':2 'epic':24 'episode':29 'for':57 'george':21 'han':47 'he':|
      10|Star Wars                                     |A series of six films from the Director, Screenwriter and Producer George Lucas. Luke Skywalker, Princes Leia, Darth Vader, C3PO, R2D2 and many other characters from the film are now house hold names from one of the most successful film projects of all ti|'a':1 'all':43 'and':10,22 'are':29 'c3po':20 'characters':25 'darth':18 'director':8 'film':28,40 'films':5 'from':6,26,34 'george':12 'hold':32 'house':31 'leia':17 'lucas':13 'luke':14 'many':23 'most':38 'names':33 'now':30 'of':3,36,42 'one':35 'othe|
      11|Star Wars: Episode IV – A New Hope            |A New Hope was the first Star Wars film from the director, screenwriter, and producer George Lucas, although it is the fourth episode in the series of six. Luke Skywalker, Princes Leia, Darth Vader, C3PO, R2D2 and many other characters from the film are n|'a':1 'all':59 'although':18 'and':14,37 'are':44 'c3po':35 'characters':40 'darth':33 'director':12 'episode':23 'film':9,43,56 'first':6 'fourth':22 'from':10,41,50 'george':16 'hold':48 'hope':3 'house':47 'house-hold-names':46 'in':24 'is':20 'it':19 |

You can see on the last column the text search vector of words, with their position, that is used by the @@ function. And the text search query generated from the google-like syntax is:

yugabyte=#  select websearch_to_tsquery('simple',
            'Luke and Leia and "George Lucas"'
            );

                  websearch_to_tsquery
-------------------------------------------------------------
 'luke' & 'and' & 'leia' & 'and' & 'george' <-> 'lucas'
(1 row)

This Text Search syntax is powerful (<-> is for consecutive words and you can use <3> to accept them spaced by 2 words). The goal of this post is not to go into all details as it is documented in many places. For common searches, the websearch_to_tsquery syntax is probably easier.

This looks good except that the underlying query has to scan the whole table:

yugabyte=#  explain (analyze, verbose) 
            select * from movies_search  where
            websearch_to_tsquery('simple',
            'Luke and Leia and "George Lucas"'
            ) @@ movie_abstract_ts;

                                                                                                                                                                                                                                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..716.39 rows=1000 width=104) (actual time=20.359..79.051 rows=3 loops=1)
   Output: movies.id, movies.name, movie_abstracts_en.abstract, to_tsvector('simple'::regconfig, movie_abstracts_en.abstract)
   Inner Unique: true
   ->  Seq Scan on public.movie_abstracts_en  (cost=0.00..352.50 rows=1000 width=40) (actual time=19.203..76.525 rows=3 loops=1)
         Output: movie_abstracts_en.movie_id, movie_abstracts_en.abstract
         Filter: ('''luke'' & ''and'' & ''leia'' & ''and'' & ''george'' <-> ''lucas'''::tsquery @@ to_tsvector('simple'::regconfig, movie_abstracts_en.abstract))
         Rows Removed by Filter: 2683
   ->  Index Scan using movies_pkey on public.movies  (cost=0.00..0.11 rows=1 width=40) (actual time=0.766..0.766 rows=1 loops=3)
         Output: movies.id, movies.name, movies.parent_id, movies.date, movies.series_id, movies.kind, movies.runtime, movies.budget, movies.revenue, movies.homepage, movies.vote_average, movies.votes_count
         Index Cond: (movies.id = movie_abstracts_en.movie_id)
 Planning Time: 2.722 ms
 Execution Time: 79.153 ms
(12 rows)

The ::tsquery @@ to_tsvector condition is a Filter after Seq Scan. A regular index would not help. This is where GIN indexes come into play as they can index the array of words.

YBGIN, the GIN index in YugabyteDB

I can create an index on the function behind "movie_abstract_ts". The GIN index will have entries for each array element, which are words when the text is parsed by to_tsvector()

yugabyte=#  create index movie_abstracts_en_ts_vector 
            on movie_abstracts_en 
            using ybgin ( 
            ( to_tsvector('pg_catalog.simple',abstract) ) 
            );
CREATE INDEX

Now the same EXPLAIN shows the filter in Index Cond:

                                                                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=24.00..395.90 rows=1000 width=104) (actual time=3.729..5.030 rows=3 loops=1)
   Output: movies.id, movies.name, movie_abstracts_en.abstract, to_tsvector('simple'::regconfig, movie_abstracts_en.abstract)
   Inner Unique: true
   ->  Index Scan using movie_abstracts_en_ts_vector on public.movie_abstracts_en  (cost=24.00..32.01 rows=1000 width=40) (actual time=3.215..3.486 rows=3 loops=1)
         Output: movie_abstracts_en.movie_id, movie_abstracts_en.abstract
         Index Cond: ('''luke'' & ''and'' & ''leia'' & ''and'' & ''george'' <-> ''lucas'''::tsquery @@ to_tsvector('simple'::regconfig, movie_abstracts_en.abstract))
         Rows Removed by Index Recheck: 7
   ->  Index Scan using movies_pkey on public.movies  (cost=0.00..0.11 rows=1 width=40) (actual time=0.453..0.453 rows=1 loops=3)
         Output: movies.id, movies.name, movies.parent_id, movies.date, movies.series_id, movies.kind, movies.runtime, movies.budget, movies.revenue, movies.homepage, movies.vote_average, movies.votes_count
         Index Cond: (movies.id = movie_abstracts_en.movie_id)
 Planning Time: 4.134 ms
 Execution Time: 5.290 ms
(12 rows)

The GIN index is used to filter-out most of the non-matching rows, but there can be some false positives. Here, from the 2686 rows in movie_abstracts_en 10 have been selected by the index and filtered further (Rows Removed by Index Recheck: 7) down to the resulting 3 (rows=3)

This was a quick introduction to Text Search in PostgreSQL and YugabyteDB. Many OLTP application require some free search on a few text columns. This is very simple to do within a PostgreSQL compatible database. And with a distributed SQL database, like YugabyteDB, it scales out, like the specialized text search services, because the table and the index are sharded and replicated into multiple nodes.

↧

Yugo Nagata: Transition Tables in Incremental View Maintenance (Part II): Multliple Tables Modification case

November 28, 2021, 4:00 pm

≫ Next: Robert Haas: Collation Stability

≪ Previous: Franck Pachot: Text Search example with the OMDB sample database

Introduction

In a previous post, I explained how we use transition tables in our implementation of Incremental View Maintenance (IVM) on PostgreSQL. Transition table is a features of AFTER trigger which allows trigger functions to refer to the changes of a table that occurred in a statement. We are using transition tables in order to extract table changes needed to calculate changes to be applied on views .

In this article I describes a more complicated situation, specifically how we handle transition tables when multiple tables are modified in a statement.

Single Table Modification

In a case where a single table is modified in a statement, the view maintenance process is simple. For example, suppose we have three tables R, S, and T. We also define a materialized view V = R ⨝ S ⨝ T that joins these tables as bellow:

SELECT x,y,z FROM R,S,T WHERE R.i=S.i AND S.j=T.j;

Then, suppose that table R was modified in a statement. This operation can be written as R ← R ∸ ∇R ⊎ ΔR, where ∇R is a bag of tuples deleted from R, and ΔR is a bag of tuples inserted into R in this statement. In this case, the changes are calculated as ∇V = ∇R ⨝ S ⨝ T and ΔV = ΔR ⨝ S ⨝ T, and we can update the view as V ← V ∸ ∇V ⊎ ΔV. The SQL representation of these calculations is as follows:

-- ∇V: tuples to be deleted from the view 
SELECT x,y,z FROM R_old,S,T WHERE R_old.i=S.i AND S.j=T.j;
-- ΔV: tuples to be inserted into the view
SELECT x,y,z FROM R_new,S,T WHERE R_new.i=S.i AND S.j=T.j;

where R_old and R_new are transition tables corresponding to ∇R and ΔR, respectively.

Multiple Tables Modification

Now, let’s see cases where multiple tables are modified in a statement. You can observe it when you use modifying CTEs (WITH clause), like:

WITH i1 AS(INSERTINTO R VALUES(1,10) RETURNING 1),
     i2 AS(INSERTINTO S VALUES(1,100) RETURNING 1)
SELECT;

In addition, multiple tables can be updated when you use triggers, or foreign key constraint.

Pre-Update State of Tables

At that time, we need the state of tables before the modification. For example, when some tuples, ΔR, ΔS, and ΔT are inserted into R, S and T respectively, the tuples that will be inserted into the view are calculated by the following three queries:

SELECT x,y,z FROM R_new,S_pre,T_pre WHERE R_new.i=S_pre.i AND S_pre.j=T_pre.j;
SELECT x,y,z FROM S,S_new,T_pre WHERE R.i=S_new.i AND S_new.j=T_pre.j;
SELECT x,y,z FROM R,S,T_new WHERE R.i=S.i AND S.j=T_new.j;

where R_new, S_new, T_new are transition tables corresponding to ΔR, ΔS and, ΔT respectively. S_pre, T_pre are the table states before the modification, R, S are the current states of tables, that is, after the modification.

In our implementation, the “pre-update state of table” is calculated by using some system columns. Specifically, tuples inserted into the table are filtered by cmin/xminsystem columns. Also, tuples deleted from the table are put back by appending tuples stored in the old transition table.

Collecting Multiple Transition Tables

As shown above, we need transition tables as many as table modifications for view maintenance. Therefore, transition tables for each modification are collected in each AFTER trigger function call. Then, the view maintenance is performed in the last call of the trigger.

However, in the original PostgreSQL, lifespan of transition tables is not enough long to preserve all transition tables during multiple modifications. Therefore, in order to implement IVM feature, I fixed the trigger code in PostgreSQL to allow to prolong a transition table’s lifespan and prevent it from being freed from memory early.

Summary

In this article, I explained how transition table is used when multiple tables are modified simultaneously in our IVM implementation. Transition tables are useful feature to extract table changes, but when we wanted to use it to implement IVM, we had to devise methods for calculating pre-update state and prolonging transition tables lifespan. If there would cases that these transition table’s features are useful other than IVM, implementing them in PostgreSQL core might make sense. However, we don’t have good idea about use cases other than IVM for now, so they are just for IVM.

↧

Robert Haas: Collation Stability

November 29, 2021, 12:39 pm

≫ Next: Hans-Juergen Schoenig: Primary Keys vs. UNIQUE Constraints in PostgreSQL

≪ Previous: Yugo Nagata: Transition Tables in Incremental View Maintenance (Part II): Multliple Tables Modification case

When PostgreSQL needs to sort strings, it relies on either the operating system (by default) or the ICU collation library (if your PostgreSQL has been built with support for ICU and you have chosen to use an ICU-based collation) to tell it in what order the strings ought to be sorted. Unfortunately, operating system behaviors are confusing and inconsistent, and they change relatively frequently for reasons that most people can't understand. That's a problem for PostgreSQL users, especially PostgreSQL users who create indexes on text columns. The first step in building a btree index to sort the data, and if this sort order differs from the one used for later index lookups, data that is actually present in the index may not be found, and your queries may return wrong answers.Read more »

↧

Hans-Juergen Schoenig: Primary Keys vs. UNIQUE Constraints in PostgreSQL

November 30, 2021, 1:00 am

≫ Next: Miranda Auhl: PostgreSQL vs Python for data cleaning: A guide

≪ Previous: Robert Haas: Collation Stability

Most of my readers will know about primary keys and all kinds of table constraints. However, only a few of you may have ever thought about the difference between a primary key and a UNIQUE constraint. Isn’t it all just the same? In both cases, PostgreSQL will create an index that avoids duplicate entries. So what is the difference? Let’s dig in and find out…

What primary keys and UNIQUE constraints do

The following example shows both a primary key and a unique constraint:

test=# CREATE TABLE t_sample (a int PRIMARY KEY, b int UNIQUE);

CREATE TABLE

test=# \d t_sample

            Table "public.t_sample"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 a      | integer |           | not null | 
 b      | integer |           |          | 

Indexes:
    "t_sample_pkey" PRIMARY KEY, btree (a)
    "t_sample_b_key" UNIQUE CONSTRAINT, btree (b)

The really important observation is that both features make PostgreSQL create an index. This is important because people often use additional indexes on primary keys or unique columns. These additional indexes are not only unnecessary, but actually counterproductive.

The key to success: NULL handling

What makes a primary key different from a unique index is the way NULL entries are handled. Let’s take a look at a simple example:

test=# INSERT INTO t_sample VALUES (1, NULL);

INSERT 0 1

The example above works perfectly. PostgreSQL will accept the NULL value for the second column. As long as the primary key contains a unique value, we are OK. However, if that changes, then an error will occur:

test=# INSERT INTO t_sample VALUES (NULL, 2);

ERROR:  null value in column "a" of relation "t_sample" violates not-null constraint
DETAIL:  Failing row contains (null, 2).

This is actually the single biggest difference between these two types of constraints. Keep that in mind.

Using foreign keys

The next logical question which arises is: What does that mean for foreign keys? Does it make a difference? Can we reference primary keys as well as unique constraints?

The simple answer is yes:

test=# CREATE TABLE t_fk_1 (
id  serial  PRIMARY KEY, 
aid  int  REFERENCES  t_sample (a)
); 

CREATE TABLE

test=# CREATE TABLE t_fk_2 (
id  serial  PRIMARY KEY, 
bid  int  REFERENCES t_sample (b)
); 

CREATE TABLE

test=# \d t_fk_1

                            Table "public.t_fk_1"
 Column |  Type   | Collation | Nullable |              Default               
--------+---------+-----------+----------+------------------------------------
 id     | integer |           | not null | nextval('t_fk_1_id_seq'::regclass)
 aid    | integer |           |          | 

Indexes:
    "t_fk_1_pkey" PRIMARY KEY, btree (id)

Foreign-key constraints:
    "t_fk_1_aid_fkey" FOREIGN KEY (aid) REFERENCES t_sample(a)

test=# \d t_fk_2

                            Table "public.t_fk_2"
 Column |  Type   | Collation | Nullable |              Default               
--------+---------+-----------+----------+------------------------------------
 id     | integer |           | not null | nextval('t_fk_2_id_seq'::regclass)
 bid    | integer |           |          | 

Indexes:
    "t_fk_2_pkey" PRIMARY KEY, btree (id)

Foreign-key constraints:
    "t_fk_2_bid_fkey" FOREIGN KEY (bid) REFERENCES t_sample(b)

It’s perfectly acceptable to reference a unique column containing NULL entries, in other words: We can nicely reference primary keys as well as unique constraints equally – there are absolutely no differences to worry about.

If you want to know more about NULL in general, check out my post about NULL values in PostgreSQL.

Finally…

Primary keys and unique constraints are not only important from a logical perspective, they also matter from a database-performance point of view. Indexing in general can have a significant impact on performance. This is true for read as well as write transactions. If you want to ensure good performance, and if you want to read something about PostgreSQL performance right now, check out our blog.

The post Primary Keys vs. UNIQUE Constraints in PostgreSQL appeared first on CYBERTEC.

↧

Miranda Auhl: PostgreSQL vs Python for data cleaning: A guide

December 1, 2021, 6:59 am

≫ Next: Ryan Lambert: Permissions required for PostGIS

≪ Previous: Hans-Juergen Schoenig: Primary Keys vs. UNIQUE Constraints in PostgreSQL

Introduction

During analysis, you rarely - if ever - get to go directly from evaluating data to transforming and analyzing it. Sometimes to properly evaluate your data, you may need to do some pre-cleaning before you get to the main data cleaning, and that’s a lot of cleaning! In order to accomplish all this work, you may use Excel, R, or Python, but are these the best tools for data cleaning tasks?

In this blog post, I explore some classic data cleaning scenarios and show how you can perform them directly within your database using TimescaleDB and PostgreSQL, replacing the tasks that you may have done in Excel, R, or Python. TimescaleDB and PostgreSQL cannot replace these tools entirely, but they can help your data munging/cleaning tasks be more efficient and, in turn, let Excel, R, and Python shine where they do best: in visualizations, modeling, and machine learning.

Cleaning is a very important part of the analysis process and generally can be the most grueling from my experience! By cleaning data directly within my database, I am able to perform a lot of my cleaning tasks one time rather than repetitively within a script, saving me considerable time in the long run.

A recap of the data analysis process

I began this series of posts on data analysis by presenting the following summary of the analysis process:

Image showing Evaluate -> Clean -> Transform -> Model, accompanied by icons which relate to each step — Data Analysis Lifecycle

The first three steps of the analysis lifecycle (evaluate, clean, transform) comprise the “data munging” stages of analysis. Historically, I have done my data munging and modeling all within Python or R, these being excellent options for analysis. However, once I was introduced to PostgreSQL and TimescaleDB, I found how efficient and fast it was to do my data munging directly within my database. In my previous post, I focused on showing data evaluation techniques and how you can replace tasks previously done in Python with PostgreSQL and TimescaleDB code. I now want to move on to the second step, data cleaning. Cleaning may not be the most glamorous step in the analysis process, but it is absolutely crucial to creating accurate and meaningful models.

As I mentioned in my last post, my first job out of college was at an energy and sustainability solutions company that focused on monitoring all different kinds of utility usage - such as electricity, water, sewage, you name it - to figure out how our clients’ buildings could be more efficient. My role at this company was to perform data analysis and business intelligence tasks.

Throughout my time in this job, I got the chance to use many popular data analysis tools including Excel, R, and Python. But once I tried using a database to perform my data munging tasks - specifically PostgreSQL and TimescaleDB - I realized how efficient and straightforward analysis, and particularly cleaning tasks, could be when done directly in a database.

Before using a database for data cleaning tasks, I would often find either columns or values that needed to be edited. I would pull the raw data from a CSV file or database, then make any adjustments to this data within my Python script. This meant that every time I ran my Python script, I would have to wait for my machine to spend computational time setting up and cleaning my data. This means that I lost time with every run of the script. Additionally, if I wanted to share cleaned data with colleagues, I would have to run the script or pass it along to them to run. This extra computational time could add up depending on the project.

Instead, with PostgreSQL, I can write a query to do this cleaning once and then store the results in a table. I wouldn’t need to spend time cleaning and transforming data again and again with a Python script, I could just set up the cleaning process in my database and call it a day! Once I started to make cleaning changes directly within my database, I was able to skip performing cleaning tasks within Python and simply focus on jumping straight into modeling my data.

To keep this post as succinct as possible, I chose to only show side-by-side code comparisons for Python and PostgreSQL. If you have any questions about other tools or languages, please feel free to join our Slack channel, where you can ask the Timescale community, or me, specific questions about Timescale or PostgreSQL functionality 😊. I’d love to hear from you!

Additionally, as we explore TimescaleDB and PostgreSQL functionality together, you may be eager to try things out right away! Which is awesome! The easiest way to get started is by signing up for a free 30-day trial of Timescale Cloud (if you prefer self-hosting, you can always install and manage TimescaleDB on your own PostgreSQL instances). Learn more by following one of our many tutorials.

Now, before we dip into things and get our data, as Outkast best put it, “So fresh, So clean”, I want to quickly cover the data set I will be using. In addition, I also want to note that all the code I show will assume you have some basic knowledge of SQL. If you are not familiar with SQL, don’t worry! In my last post, I included a section on SQL basics which you can find here.

About the sample dataset

In my experience within the data science realm, I have done the majority of my data cleaning after evaluation. However, sometimes it can be beneficial to clean data, evaluate, and then clean again. The process you choose is dependent on the initial state of your data and how easy it is to evaluate. For the data set I will use today, I would likely do some initial cleaning before evaluation and then clean again after, and I will show you why.

I got the following IoT data set from Kaggle, where a very generous individual shared their energy consumption readings from their apartment in San Jose CA, this data incrementing every 15 minutes. While this is awesome data, it is structured a little differently than I would like. The raw data set follows this schema:

Graphic showing the setup of the table. The tables name is 'energy_usage_staging'. each row contains the tables column and data types, the pairs of info are as follows ([type, text], [date, date], [start_time, time], [end_time, time], [usage, float4], [units,text], [cost, text], [notes, text])

and appears like this…

type	date	start_time	end_time	usage	units	cost
Electric usage	2016-10-22	00:00:00	00:14:00	0.01	kWh	$0.00
Electric usage	2016-10-22	00:15:00	00:29:00	0.01	kWh	$0.00
Electric usage	2016-10-22	00:30:00	00:44:00	0.01	kWh	$0.00
Electric usage	2016-10-22	00:45:00	00:59:00	0.01	kWh	$0.00
Electric usage	2016-10-22	01:00:00	01:14:00	0.01	kWh	$0.00
Electric usage	2016-10-22	01:15:00	01:29:00	0.01	kWh	$0.00
Electric usage	2016-10-22	01:30:00	01:44:00	0.01	kWh	$0.00
Electric usage	2016-10-22	01:45:00	01:59:00	0.01	kWh	$0.00

In order to do any type of analysis on this data set, I want to clean it up. A few things that quickly come to mind include:

The cost is seen as a text data type which will cause some issues.
The time columns are split apart which could cause some problems if I want to create plots over time or perform any type of modeling based on time.
I may also want to filter the data based on various parameters that have to do with time, such as day of the week or holiday identification (both potentially play into how energy is used within the household).

In order to fix all of these things and get more valuable data evaluation and analysis, I will have to clean the incoming data! So without further ado, let’s roll up our sleeves and dig in!

Cleaning the data

I will show most of the techniques I have used in the past while working in data science. While these examples are not exhaustive, I hope they will cover many of the cleaning steps you perform during your own analysis, helping to make your cleaning tasks more efficient by using PostgreSQL and TimescaleDB.

Please feel free to explore these various techniques and skip around if you need! There is a lot here, and I designed it to be a helpful glossary of tools that you could use as you need.

The techniques that I will cover include:

Note on cleaning approach:

There are many ways that I could approach the cleaning process in PostgreSQL. I could create a table then ALTER it as I clean, I could create multiple tables as I add or change data, or I could work with VIEWs. Depending on the size of my data, any of these approaches could make sense, however, they will have different computational consequences.

You may have noticed above that my raw data table was called energy_usage_staging. This is because I decided that given the state of my raw data, it would be best for me to place the raw data in a staging table, clean it using VIEWs, then insert it into a more usable table as part of my cleaning process. This move from raw table to usable table could happen even before the evaluation step of analysis. As I discussed above, sometimes data cleaning has to occur after AND before evaluating your data. Regardless, this data needs to be cleaned and I wanted to use the most efficient method possible. In this case, that meant using a staging table and leveraging the efficiency and power of PostgreSQL VIEWs, something I will talk about later.

Generally, if you are dealing with a lot of data, altering an existing table in PostgreSQL can be costly. For this post I will show you how to build up clean data using VIEWs along with additional tables. This method of cleaning is more efficient and sets you up for the next blog post about data transformation which includes the use of scripts in PostgreSQL.

Correcting structural issues

Right off the bat, I know that I need to do some data refactoring on my raw table due to data types. Notice that we have date and time columns separated and costs is recorded as a text data type. I need to convert my separated date time columns to a timestamp and the cost column to float4. But before I show that, I want to talk about why conversion to timestamp is beneficial.

TimescaleDB hypertables and why timestamp is important

For those of you not familiar with the structure of TimescaleDB hypertables, they are at the basis of how we efficiently query and manipulate time-series data. Timescale hypertables are partitioned based on time, and more specifically by the time column you specify upon creation of the table.

The data is partitioned by timestamp into "chunks" so that every row in the table belongs to some chunk based on a time range. We then use these time chunks to help query the rows so that you can get more efficient querying and data manipulation based on time. This image represents the difference between a normal table and our special hypertables.

Graphic showing a normal table vs a hypertable. The normal table just shows data in a table. The hypertable shows data in the table, but it also shows the data being "grouped" or "chunked" by day. By adding an index like structure based on time, queries can be more efficient.

Changing date-time structure

Because I want to utilize TimescaleDB functionality to the fullest, such as continuous aggregates and faster time based queries, I want to restructure the energy_usage_staging table's date and time columns. I could use the date column for my hypertable partitioning, however, I would have limited control over manipulating my data based on time. It is more flexible and space efficient to have a single column with a timestamp than it is to have separate columns with date and time. I can always extract the date or time from the timestamp if I want to later!

Looking back at the table structure, I should be able to get a usable timestamp value from the date and start_time columns as the end_time really doesn’t give me that much useful information. Thus, I want to essentially combine these two columns to form a new timestamp column, let’s see how I can do that using SQL. Spoiler alert, it is as simple as an algebraic statement. How cool is that?!

PostgreSQL code:

In PostgreSQL I can create the column without inserting it into the database just yet. Since I want to create a NEW table from this staging one, I don’t want to add more columns or tables just yet.

Let’s first compare the original columns with our new generated column. For this query I simply add the two columns together. The AS keyword just allows me to rename the column to whatever I would like, in this case being time.

--add the date column to the start_time column
SELECT date, start_time, (date + start_time) AS time 
FROM energy_usage_staging eus;

Results:

date	start_time	time
2016-10-22	00:00:00	2016-10-22 00:00:00.000
2016-10-22	00:15:00	2016-10-22 00:15:00.000
2016-10-22	00:30:00	2016-10-22 00:30:00.000
2016-10-22	00:45:00	2016-10-22 00:45:00.000
2016-10-22	01:00:00	2016-10-22 01:00:00.000
2016-10-22	01:15:00	2016-10-22 01:15:00.000

Python code:

In Python, the easiest way to do this is to add a new column to the dataframe. Notice that in Python I would have to concatenate the two columns along with a defined space, then convert that column to datetime.

energy_stage_df['time'] = pd.to_datetime(energy_stage_df['date'] + ' ' + energy_stage_df['start_time'])
print(energy_stage_df[['date', 'start_time', 'time']])

Changing column data types

Next, I want to change the data type of my cost column from text to float. Again, this is straightforward in PostgreSQL with the TO_NUMBER() function.

The format of the function is as follows: TO_NUMBER(‘text’, ‘format’) . The ‘format’ input is a PostgreSQL specific string that you can build depending on what type of text you want to convert. In our case we have a $ symbol followed by a numeric set up 0.00. For the format string I decided to use ‘L99D99’. The L lets PostgreSQL know there is a money symbol at the beginning of the text, the 9s let the system know I have numeric values, and then the D stands for a decimal point.

I decided to cap the conversion on values that would be less than or equal to ‘$99.99’ because the cost column has no values greater than 0.65. If you were planning to convert a column with larger numeric values, you would want to account for that by adding in a G for commas. For example, say you have a cost column with text values like this ‘$1,672,278.23’ then you would want to format the string like this ‘L9G999G999D99’

PostgreSQL code:

--create a new column called cost_new with the to_number() function
SELECT cost, TO_NUMBER("cost", 'L9G999D99') AS cost_new
FROM energy_usage_staging eus  
ORDER BY cost_new DESC

Results:

cost	cost_new
$0.65	0.65
$0.65	0.65
$0.65	0.65
$0.57	0.57
$0.46	0.46
$0.46	0.46
$0.46	0.46
$0.46	0.46

Python code:

For Python, I used a lambda function which systematically replaces all the ‘$’ signs with empty strings. This can be fairly inefficient.

energy_stage_df['cost_new'] = pd.to_numeric(energy_stage_df.cost.apply(lambda x: x.replace('$','')))
print(energy_stage_df[['cost', 'cost_new']])

Creating a `VIEW`

Now that I know how to convert my columns, I can combine the two queries and create a VIEW of my new restructured table. A VIEW is a PostgreSQL object which allows you to define a query and call it by it’s VIEWs name, as if it were a table within your database. I can use the following query to generate the data I want and then create a VIEW that I can query it as if it were a table.

PostgreSQL code:

-- query the right data that I want
SELECT type, 
(date + start_time) AS time, 
"usage", 
units, 
TO_NUMBER("cost", 'L9G999D99') AS cost, 
notes 
FROM energy_usage_staging

Results:

type	time	usage	units
Electric usage	2016-10-22 00:00:00.000	0.01	kWh
Electric usage	2016-10-22 00:15:00.000	0.01	kWh
Electric usage	2016-10-22 00:30:00.000	0.01	kWh
Electric usage	2016-10-22 00:45:00.000	0.01	kWh
Electric usage	2016-10-22 01:00:00.000	0.01	kWh
Electric usage	2016-10-22 01:15:00.000	0.01	kWh
Electric usage	2016-10-22 01:30:00.000	0.01	kWh
Electric usage	2016-10-22 01:45:00.000	0.01	kWh
Electric usage	2016-10-22 02:00:00.000	0.02	kWh
Electric usage	2016-10-22 02:15:00.000	0.02	kWh

I decided to call my VIEWenergy_view. Now, when I want to do further cleaning, I can just specify its name in the FROM statement.

--create view from the query above
CREATE VIEW energy_view AS
SELECT type, 
(date + start_time) AS time, 
"usage", 
units, 
TO_NUMBER("cost", 'L9G999D99') AS cost, 
notes 
FROM energy_usage_staging

Python code:

energy_df = energy_stage_df[['type','time','usage','units','cost_new','notes']]
energy_df.rename(columns={'cost_new':'cost'}, inplace = True)
print(energy_df.head(20))

It is important to note that with PostgreSQL VIEWs, the data inside of them have to be recalculated every time you query it. This is why we want to insert our VIEW data into a hypertable once we have the data set up just right. You can think of VIEWs as a shorthand version of CTEs WITHAS statement I discussed in my last post.

We are now one step closer to cleaner data!

Creating or generating relevant data

With some quick investigation, we can see that the notes column is blank for this data set. To check this I just need to include a WHERE clause and specify where notes are not equal to an empty string.

PostgreSQL code:

SELECT * 
FROM energy_view ew
-- where notes are not equal to an empty string
WHERE notes!='';

Results come out empty

Python code:

print(energy_df[energy_df['notes'].notnull()])

Since the notes are blank, I would like to replace the column with various sets of additional information that I could use later on during modelling. One thing I would like to add in particular, is a column that specifies the day of the week. To do this I can use the EXTRACT() command. The EXTRACT() command is a PostgreSQL date/time function which allows you to extract various date/time elements. For our column, PostgreSQL has the specification DOW (day-of-week) which maps 0 to Sunday through to 6 for Saturday.

PostgreSQL code:

--extract day-of-week from date column and cast the output to an int
SELECT *,
EXTRACT(DOW FROM time)::int AS day_of_week
FROM energy_view ew

Results:

type	time	usage	units	day_of_week
Electric usage	2016-10-22 00:00:00.000	0.01	kWh	6
Electric usage	2016-10-22 00:15:00.000	0.01	kWh	6
Electric usage	2016-10-22 00:30:00.000	0.01	kWh	6
Electric usage	2016-10-22 00:45:00.000	0.01	kWh	6
Electric usage	2016-10-22 01:00:00.000	0.01	kWh	6
Electric usage	2016-10-22 01:15:00.000	0.01	kWh	6

Python code:

energy_df['day_of_week'] = energy_df['time'].dt.dayofweek

Additionally, we may want to add another column that specifies if a day occurs over a weekend or weekday. I will do this by creating a boolean column, where true represents a weekend, and false represents a weekday. To do this, I will apply a CASE statement . With this command I can specify “when-then” statements (similar to “if-then” statements in coding) where I can say WHEN a day_of_week value is IN the set (0,6) THEN the output should be true, ELSE the value should be false.

PostgreSQL code:

SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week, 
--use the case statement to make a column true when records fall on a weekend aka 0 and 6
CASE WHEN (EXTRACT(DOW FROM time)::int) IN (0,6) then true
	ELSE false
END AS is_weekend
FROM energy_view ew

Results:

type	time	usage	units	day_of_week	is_weekend
Electric usage	2016-10-22 00:00:00.000	0.01	kWh	6	true
Electric usage	2016-10-22 00:15:00.000	0.01	kWh	6	true
Electric usage	2016-10-22 00:30:00.000	0.01	kWh	6	true
Electric usage	2016-10-22 00:45:00.000	0.01	kWh	6	true
Electric usage	2016-10-22 01:00:00.000	0.01	kWh	6	true

Fun fact: you can do the same query without a CASE statement, however it only works for binary columns.

--another method to create a binary column
SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week, 
EXTRACT(DOW FROM time)::int IN (0,6) AS is_weekend
FROM energy_view ew

Python code:

Notice that in Python, the weekends are represented by numbers 5 and 6 vs the PostgreSQL weekend values 0 and 6.

energy_df['is_weekend'] = np.where(energy_df['day_of_week'].isin([5,6]), 1, 0)
print(energy_df.head(20))

And maybe things then start getting real crazy, maybe you want to add more parameters!

Let’s consider holidays. Now you may be asking “Why in the world would we do that?!”, but often people have time off during some of the holidays within the US. Since this individual lives within the US, they likely have at least some of the holidays off whether they are the day of OR a federal holiday. Where there are days off, there could be a difference in energy usage. To help guide my analysis, I want to include the identification of holidays. To do this, I’m going to create another boolean column that identifies when a federal holiday occurs.

To do this, I am going to use TimescaleDB’s time_bucket() function. The time_bucket() function is one of the functions I discussed in detail within my previous post. Essentially, I need to use this function to make sure all time values within a single day get accounted for. Without using the time_bucket() function, I would only see changes to the row associated with the 12am time period.

PostgreSQL code:

After I create a holiday table, I can then use the data from it within my query. I also decided to use the non-case syntax for this query. Note that you can use either!

--create table for the holidays
CREATE TABLE holidays (
date date)

--insert the holidays into table
INSERT INTO holidays 
VALUES ('2016-11-11'), 
('2016-11-24'), 
('2016-12-24'), 
('2016-12-25'), 
('2016-12-26'), 
('2017-01-01'),  
('2017-01-02'), 
('2017-01-16'), 
('2017-02-20'), 
('2017-05-29'), 
('2017-07-04'), 
('2017-09-04'), 
('2017-10-9'), 
('2017-11-10'), 
('2017-11-23'), 
('2017-11-24'), 
('2017-12-24'), 
('2017-12-25'), 
('2018-01-01'), 
('2018-01-15'), 
('2018-02-19'), 
('2018-05-28'), 
('2018-07-4'), 
('2018-09-03'), 
('2018-10-8')

SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week, 
EXTRACT(DOW FROM time)::int IN (0,6) AS is_weekend,
-- I can then select the data from the holidays table directly within my IN statement
time_bucket('1 day', time) IN (SELECT date FROM holidays) AS is_holiday
FROM energy_view ew

Results:

type	time	usage	units	day_of_week	is_weekend	is_holiday
Electric usage	2016-10-22 00:00:00.000	0.01	kWh	6	true	false
Electric usage	2016-10-22 00:15:00.000	0.01	kWh	6	true	false
Electric usage	2016-10-22 00:30:00.000	0.01	kWh	6	true	false
Electric usage	2016-10-22 00:45:00.000	0.01	kWh	6	true	false
Electric usage	2016-10-22 01:00:00.000	0.01	kWh	6	true	false
Electric usage	2016-10-22 01:15:00.000	0.01	kWh	6	true	false

Python code:

holidays = ['2016-11-11', '2016-11-24', '2016-12-24', '2016-12-25', '2016-12-26', '2017-01-01',  '2017-01-02', '2017-01-16', '2017-02-20', '2017-05-29', '2017-07-04', '2017-09-04', '2017-10-9', '2017-11-10', '2017-11-23', '2017-11-24', '2017-12-24', '2017-12-25', '2018-01-01', '2018-01-15', '2018-02-19', '2018-05-28', '2018-07-4', '2018-09-03', '2018-10-8']
energy_df['is_holiday'] = np.where(energy_df['day_of_week'].isin(holidays), 1, 0)
print(energy_df.head(20))

At this point, I’m going to save this expanded table into another VIEW so that I can call the data without writing out the query.

PostgreSQL code:

--create another view with the data from our first round of cleaning
CREATE VIEW energy_view_exp AS
SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week, 
EXTRACT(DOW FROM time)::int IN (0,6) AS is_weekend,
time_bucket('1 day', time) IN (select date from holidays) AS is_holiday
FROM energy_view ew

You may be asking, “Why did you create these as boolean columns??”, a very fair question! You see, I may want to use these columns for filtering during analysis, something I commonly do during my own analysis process. In PostgreSQL, when you use boolean columns you can filter things super easily. For example, say that I want to use my table query so far and show only the data that occurs over the weekend AND a holiday. I can do this simply by adding in a WHERE statement along with the specified columns.

PostgreSQL code:

--if you use binary columns, then you can filter with a simple WHERE statement
SELECT *
FROM energy_view_exp
WHERE is_weekend AND is_holiday

Results:

type	time	usage	units	cost	day_of_week	is_weekend	is_holiday
Electric usage	2016-12-24 00:00:00.000	0.34	kWh	0.06	6	true	true
Electric usage	2016-12-24 00:15:00.000	0.34	kWh	0.06	6	true	true
Electric usage	2016-12-24 00:30:00.000	0.34	kWh	0.06	6	true	true
Electric usage	2016-12-24 00:45:00.000	0.34	kWh	0.06	6	true	true
Electric usage	2016-12-24 01:00:00.000	0.34	kWh	0.06	6	true	true
Electric usage	2016-12-24 01:15:00.000	0.34	kWh	0.06	6	true	true

Python code:

print(energy_df[(energy_df['is_weekend']==1) & (energy_df['is_holiday']==1)].head(10))

Adding data to a hypertable

Now that I have new columns ready to go and I know how I would like my table to be structured, I can create a new hypertable and insert my cleaned data. In my own analysis with this data set, I may have done the cleaning up to this point BEFORE evaluating my data so that I can get a more meaningful evaluation step in analysis. What’s great is that you can use any of these techniques for general cleaning, whether that is before or after evaluation.

PostgreSQL:

CREATE TABLE energy_usage (
type text,
time timestamptz,
usage float4,
units text,
cost float4,
day_of_week int,
is_weekend bool,
is_holiday bool,
) 

--command to create a hypertable
SELECT create_hypertable('energy_usage', 'time')

INSERT INTO energy_usage 
SELECT *
FROM energy_view_exp

Note that if you had data continually coming in you could create a script within your database that automatically makes these changes when importing your data. That way you can have cleaned data ready to go in your database rather than processing and cleaning the data in your scripts every time you want to perform analysis.

We will discuss this in detail in my next post, so make sure to stay tuned in if you want to know how to create scripts and keep data automatically updated!

Renaming values

Another valuable technique for cleaning data is being able to rename various items or remap categorical values. The importance of this skill is amplified by the popularity of this Python data analysis question on StackOverflow. The question states “How do I change a single index value in a pandas dataframe?”. Since PostgreSQL and TimescaleDB use relational table structures, renaming unique values can be fairly simple.

When renaming specific index values within a table, you can do this “on the fly” by using PostgreSQL’s CASE statement within the SELECT query. Let’s say I don’t like Sunday being represented by a 0 in the day_of_week column, but would prefer it to be a 7. I can do this with the following query.

PostgreSQL code:

SELECT type, time, usage, cost, is_weekend,
-- you can use case to recode column values 
CASE WHEN day_of_week = 0 THEN 7
ELSE day_of_week 
END
FROM energy_usage

Python code:

Caveat, this code would make Monday = 7 because the python DOW function has Monday set to 0 and Sunday set to 6. But this is how you would update one value within a column. Likely you would not want to do this exact action, I just wanted to show the python equivalent for reference.

energy_df.day_of_week[energy_df['day_of_week']==0] = 7
print(energy_df.head(250))

Now, let’s say that I wanted to actually use the names of the days of the week instead of showing numeric values? For this example, I actually want to ditch the CASE statement and create a mapping table. When you need to change various values, it will likely be more efficient to create a mapping table and then join to this table using the JOIN command.

PostgreSQL:

--first I need to create the table
CREATE TABLE day_of_week_mapping (
day_of_week_int int,
day_of_week_name text
)

--then I want to add data to my table
INSERT INTO day_of_week_mapping
VALUES (0, 'Sunday'),
(1, 'Monday'),
(2, 'Tuesday'),
(3, 'Wednesday'),
(4, 'Thursday'),
(5, 'Friday'),
(6, 'Saturday')

--then I can join this table to my cleaning table to remap the days of the week
SElECT type, time, usage, units, cost, dowm.day_of_week_name, is_weekend
FROM energy_usage eu
LEFT JOIN day_of_week_mapping dowm ON dowm.day_of_week_int = eu.day_of_week

Results:

type	time	usage	units	cost	day_of_week_name	weekend
Electric usage	2018-07-22 00:45:00.000	0.1	kWh	0.03	Sunday	true
Electric usage	2018-07-22 00:30:00.000	0.1	kWh	0.03	Sunday	true
Electric usage	2018-07-22 00:15:00.000	0.1	kWh	0.03	Sunday	true
Electric usage	2018-07-22 00:00:00.000	0.1	kWh	0.03	Sunday	true
Electric usage	2018-02-11 23:00:00.000	0.04	kWh	0.01	Sunday	true

Python:

In this case, python has similar mapping functions.

energy_df['day_of_week_name'] = energy_df['day_of_week'].map({0 : 'Sunday', 1 : 'Monday', 2: 'Tuesday', 3: 'Wednesday', 4: 'Thursday', 5: 'Friday', 6: 'Saturday'})
print(energy_df.head(20))

Hopefully, one of these techniques will be useful for you as you approach data renaming!

Additionally, remember that if you would like to change the name of a column in your table, it is truly as easy as AS (I couldn’t not use such a ridiculous statement 😂). When you use the SELECT statement, you can rename you columns like so,

PostgreSQL code:

SELECT type AS usage_type,
time as time_stamp,
usage,
units, 
cost AS dollar_amount
FROM energy_view_exp
LIMIT 20;

Results:

usage_type	time_stamp	usage	units
Electric usage	2016-10-22 00:00:00.000	0.01	kWh
Electric usage	2016-10-22 00:15:00.000	0.01	kWh
Electric usage	2016-10-22 00:30:00.000	0.01	kWh
Electric usage	2016-10-22 00:45:00.000	0.01	kWh

Python code:

Comparatively, renaming columns in Python can be a huge pain. This is an area where SQL is not only faster, but also just more elegant in it’s code.

energy_df.rename(columns={'type':'usage_type', 'time':'time_stamp', 'cost':'dollar_amount'}, inplace=True)
print(energy_df[['usage_type','time_stamp','usage','units','dollar_amount']].head(20))

Fill in missing data

Another common problem in the data cleaning process is having missing data. For the dataset we are using, there are no obviously missing data points, however, it is very possible that with evaluation, we could find missing hourly data from a power outage or some other phenomenon. This is where the gap-filling functions TimescaleDB offers could come in handy. When using algorithms, missing data can often have significant negative impacts on the accuracy or dependability of the model. Sometimes, you can navigate this problem by filling in missing data with reasonable estimates and TimescaleDB actually has built-in functions to help you do this.

For example, let’s say that you are modeling the energy usage over individual days of the week and a handful of days have missing energy data due to a power outage or an issue with the sensor. We could remove the data, or try to fill in the missing values with reasonable estimations. For today, let’s assume that the model I want to use would benefit more from filling in the missing values.

As an example, I created some data. I called this table energy_data and it is missing both time and energy readings for the timestamps between 7:45am and 11:30am.

time	energy
2021-01-01 07:00:00.000	0
2021-01-01 07:15:00.000	0.1
2021-01-01 07:30:00.000	0.1
2021-01-01 07:45:00.000	0.2
2021-01-01 11:30:00.000	0.04
2021-01-01 11:45:00.000	0.04
2021-01-01 12:00:00.000	0.03
2021-01-01 12:15:00.000	0.02
2021-01-01 12:30:00.000	0.03
2021-01-01 12:45:00.000	0.02
2021-01-01 13:00:00.000	0.03

I can use TimescaleDB’s gapfilling hyperfunctions to fill in these missing values. The interpolate() function is another one of TimescaleDB’s hyperfunctions and it creates data points that follow a linear approximation given the data points before and after the missing range of data. Alternatively, you could use the locf() hyperfunction which carries the last recorded value forward to fill in the gap (note that locf stands for last-one-carried-forward). Both of these functions must be used in conjunction with the time_bucket_gapfill() function.

PostgreSQL code:

SELECT
--here I specified that the data should increment by 15 mins
  time_bucket_gapfill('15 min', time) AS timestamp,
  interpolate(avg(energy)),
  locf(avg(energy))
FROM energy_data
--to use gapfill, you will have to take out any time data associated with null values. You can do this using the IS NOT NULL statement
WHERE energy IS NOT NULL AND time > '2021-01-01 07:00:00.000' AND time < '2021-01-01 13:00:00.000'
GROUP BY timestamp
ORDER BY timestamp;

Results:

timestamp	interpolate	locf
2021-01-01 07:00:00.000	0.1	0.10000000000000000000
2021-01-01 07:30:00.000	0.15	0.15000000000000000000
2021-01-01 08:00:00.000	0.13625	0.15000000000000000000
2021-01-01 08:30:00.000	0.1225	0.15000000000000000000
2021-01-01 09:00:00.000	0.10875	0.15000000000000000000
2021-01-01 09:30:00.000	0.095	0.15000000000000000000
2021-01-01 10:00:00.000	0.08125	0.15000000000000000000
2021-01-01 10:30:00.000	0.0675	0.15000000000000000000
2021-01-01 11:00:00.000	0.05375	0.15000000000000000000
2021-01-01 11:30:00.000	0.04	0.04000000000000000000
2021-01-01 12:00:00.000	0.025	0.02500000000000000000
2021-01-01 12:30:00.000	0.025	0.02500000000000000000

Python code:

energy_test_df['time'] = pd.to_datetime(energy_test_df['time'])
energy_test_df_locf = energy_test_df.set_index('time').resample('15 min').fillna(method='ffill').reset_index()
energy_test_df = energy_test_df.set_index('time').resample('15 min').interpolate().reset_index()
energy_test_df['locf'] = energy_test_df_locf['energy']
print(energy_test_df)

Bonus:

The following query is how I could ignore the missing data. I wanted to include this to show you just how easy it can be to exclude null data. Alternatively, I could use a WHERE clause to specify the times which I could like to ignore (the second query).

SELECT * 
FROM energy_data 
WHERE energy IS NOT NULL

SELECT * 
FROM energy_data
WHERE time <= '2021-01-01 07:45:00.000' OR time >= '2021-01-01 11:30:00.000'

Wrap Up

After reading through these various cleaning techniques, I hope you feel more comfortable with exploring some of the possibilities that PostgreSQL and TimescaleDB provide. By cleaning data directly within my database, I am able to perform a lot of my cleaning tasks a single time rather than repetitively within a script, thus saving me time in the long run. If you are looking to save time and effort while cleaning your data for analysis, definitely consider using PostgreSQL and TimescaleDB.

In my next posts, I will go over techniques on how to transform data using PostgreSQL and TimescaleDB. I'll then take everything we've learned together to benchmark data munging tasks in PostgreSQL and TimescaleDB vs. Python and pandas. The final blog post will walk you through the full process on a real dataset by conducting a deep-dive into data analysis with TimescaleDB (for data munging) and Python (for modeling and visualizations).

If you have questions about TimescaleDB, time-series data, or any of the functionality mentioned above, join our community Slack, where you'll find an active community of time-series enthusiasts and various Timescale team members (including me!).

If you’re ready to see the power of TimescaleDB and PostgreSQL right away, you can sign up for a free 30-day trial or install TimescaleDB and manage it on your current PostgreSQL instances. We also have a bunch of great tutorials to help get you started.

Until next time!

Functionality Glossary:

↧

Ryan Lambert: Permissions required for PostGIS

November 30, 2021, 9:01 pm

≫ Next: Elizabeth Garrett Christensen: PostGIS Day 2021

≪ Previous: Miranda Auhl: PostgreSQL vs Python for data cleaning: A guide

PostGIS is a widely popular spatial database extension for Postgres. It's also one of my favorite tools! A recent discussion on the People, Postgres, Data Discord server highlighted that the permissions required for various PostGIS operations were not clearly explained in the PostGIS documentation. As it turned out, I didn't know exactly what was required either. The basic PostGIS install page provides resources for installing the binary on the server and the basic CREATE EXTENSION commands, but does not explain permissions required.

This post explores the permissions required for three types of PostGIS interactions:

Install/Create PostGIS
Use PostGIS
Load data from pg_dump

Database and Users

I am using Postgres installed on my laptop for these tests, Postgres 13.5 and PostGIS 3.1. I created an empty database named postgis_perms and check the \du slash command in psql to see the current roles. This instance has my my ryanlambert role, a superuser, and the default postgres role. The postgres role is not used in this post outside of this example.

([local] 🐘) ryanlambert@postgis_perms=# \du
                                     List of roles
┌─────────────┬────────────────────────────────────────────────────────────┬───────────┐
│  Role name  │                         Attributes                         │ Member of │
╞═════════════╪════════════════════════════════════════════════════════════╪═══════════╡
│ postgres    │ Superuser, Create role, Create DB, Replication, Bypass RLS │ {}        │
│ ryanlambert │ Superuser, Create role, Create DB                          │ {}        │
└─────────────┴────────────────────────────────────────────────────────────┴───────────┘

↧

Elizabeth Garrett Christensen: PostGIS Day 2021

December 1, 2021, 12:42 pm

≫ Next: Lukas Fittl: Understanding Postgres GIN Indexes: The Good and the Bad

≪ Previous: Ryan Lambert: Permissions required for PostGIS

Crunchy Data hosted the third annual PostGIS Day on November 18th.This was our second year with a virtual format and another year of record attendance! We had attendees from more than 99 countries.

↧

Lukas Fittl: Understanding Postgres GIN Indexes: The Good and the Bad

December 2, 2021, 4:00 am

≫ Next: Amit Kapila: PostgreSQL 14 and beyond

≪ Previous: Elizabeth Garrett Christensen: PostGIS Day 2021

Adding, tuning and removing indexes is an essential part of maintaining an application that uses a database. Oftentimes, our applications rely on sophisticated database features and data types, such as JSONB, array types or full text search in Postgres. A simple B-tree index does not work in such situations, for example to index a JSONB column. Instead, we need to look beyond, to GIN indexes. Almost 15 years ago to the dot, GIN indexes were added in Postgres 8.2, and they have since become an…

↧

Amit Kapila: PostgreSQL 14 and beyond

December 1, 2021, 3:00 pm

≫ Next: Jimmy Angelakos: Slow things down to make them go faster [Postgres Build 2021]

≪ Previous: Lukas Fittl: Understanding Postgres GIN Indexes: The Good and the Bad

I would like to talk about the key features in PostgreSQL 14, and what is being discussed in the community for PostgreSQL 15 and beyond.

↧

Jimmy Angelakos: Slow things down to make them go faster [Postgres Build 2021]

December 3, 2021, 7:30 am

≫ Next: Frits Hoogland: podman machine on mac OSX 12.0.1 (Monterey)

≪ Previous: Amit Kapila: PostgreSQL 14 and beyond

It's easy to get misled into overconfidence based on the performance of powerful servers, given today's monster core counts and RAM sizes.
However, the reality of high concurrency usage is often disappointing, with less throughput than one would expect.
Because of its internals and its multi-process architecture, PostgreSQL is very particular about how it likes to deal with high concurrency and in some cases it can slow down to the point where it looks like it's not performing as it should.
In this talk we'll take a look at potential pitfalls when you throw a lot of work at your database. Specifically, very high concurrency and resource contention can cause problems with lock waits in Postgres. Very high transaction rates can also cause problems of a different nature.
Finally, we will be looking at ways to mitigate these by examining our queries and connection parameters, leveraging connection pooling and replication, or adapting the workload.

Video from my talk at this year's Postgres Build 👇

Accidental wisdom: "You can't avoid Postgres" -Jimmy

You can find the slides from the talk here.

↧

Frits Hoogland: podman machine on mac OSX 12.0.1 (Monterey)

December 3, 2021, 7:50 am

≫ Next: Paul Ramsey: Tricks for Faster Spatial Indexes

≪ Previous: Jimmy Angelakos: Slow things down to make them go faster [Postgres Build 2021]

Podman is a drop in replacement for Docker, and can handle containers daemonless and rootless ("ruthless"?). Containers work based on cgroups, namespaces and IPC, which is existing in Linux, and therefore requires a linux system to support it (which is based on Fedora CoreOS, and runs in QEMU).

Setup

Much of the configuration depends on the existence of 'brew' on OSX. If you haven't got brew (homebrew) installed, you can do so using:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

In order to run the podman machine, the podman software needs to be installed (step 1), a virtual machine for running podman on linux needs to be created (step 2), and run (step 3).

1.Install podman

brew install podman

2.Initialize podman machine

podman machine init

3.Start podman machine

podman machine start

Verify podman machine

Because the podman machine must run before it can run containers, it is useful to understand if the podman machine is running. This is done with 'podman machine list':

Up:

podman machine list

NAME      VM TYPE     CREATED       LAST UP           CPUS    MEMORY     DISK SIZE
podman-m* qemu        17 hours ago  Currently running 1       2.147GB    10.74GB

Down:

podman machine list

NAME      VM TYPE     CREATED       LAST UP           CPUS    MEMORY      DISK SIZE
podman-m* qemu        17 hours ago  3 seconds ago     1       2.147GB     10.74GB

Containers & yugabyte

This setup is ideal for developers who want an easy way to setup YugabyteDB without all the hassle of configuration.

Any type of work with podman with containers requires the podman machine to be running. The podman machine is what actually performs the container commands.

For any type of coordinated work it's important to select a version to work on for the software you are using. Using the latest version can be a different version in time, and can cause version sprawl, so I would strongly recommend always choosing a specific version.

Obtain the yugabyte docker versions available:

curl -L -s 'https://registry.hub.docker.com/v2/repositories/yugabytedb/yugabyte/tags?page_size=5' | jq '."results"[]["name"]'

"2.6.7.0-b10"
"2.11.0.0-b7"
"2.4.8.0-b16"
"2.6.6.0-b10"
"2.8.0.0-b37"

Please mind the jq executable is not installed by default on OSX, but can easily be installed using brew:

brew install jq

From the above versions, choose one to use, and obtain the image of the selected version in the following way:

podman pull yugabytedb/yugabyte:2.11.0.0-b7

Resolving "yugabytedb/yugabyte" using unqualified-search registries (/etc/containers/registries.conf.d/999-podman-machine.conf)
Trying to pull docker.io/yugabytedb/yugabyte:2.11.0.0-b7...
Getting image source signatures
Copying blob sha256:486c41cfe6bf41372e1fbbe5e644b65e27a0d088135dbd3989721cb251147731
...snipped for brevety...
Copying blob sha256:ea30bbe39b88dfca4bdc2353505ea36c9322b8e9e17f969a0aedb1f058969f88
Copying config sha256:4f1f8156a955f434215a6f8ed01d782d61179c7624cc82a300c2f111c4fa7b51
Writing manifest to image destination
Storing signatures
4f1f8156a955f434215a6f8ed01d782d61179c7624cc82a300c2f111c4fa7b51

Now a container can be started from the downloaded image:

podman run -d --name yugabyte-2.11 -p5433:5433 -p7000:7000 -p9000:9000 yugabytedb/yugabyte:2.11.0.0-b7 bin/yugabyted start --base_dir=/home/yugabyte/yb_data --daemon=false

701422c063b46462c2b5bd573c117345f996e914325e26979829e506b8bc4362

This takes a few moments to start.
When it has been started, the container and its status can be validated using podman ps:

podman ps

CONTAINER ID  IMAGE                    COMMAND               CREATED         STATUS             PORTS                                                                   NAMES
701422c063b4  ../yugabyte:2.11.0.0-b7  bin/yugabyted sta...  37 seconds ago  Up 36 seconds ago  0.0.0.0:5433->5433/tcp, 0.0.0.0:7000->7000/tcp, 0.0.0.0:9000->9000/tcp  yugabyte-2.11

If the container was successfully started, it will say 'Up' with the status. Also mind the name, which is important if you have got more than one container running.

One issue I found was that port 7000 was taken, which prevented the container from starting, because it wanted to use port 7000 on localhost. This was caused by: (OSX) system preferences>sharing>airplay-receiver, which is checked by default and needs to be unchecked.

After the container has started, it can be accessed from the CLI in the following way:

podman exec -it yugabyte-2.11 bash

[root@701422c063b4 yugabyte]#

This allows you to investigate logfiles, process statuses, etc.

Stop the yugabyte container:

podman stop yugabyte-2.11

Restart the yugabyte container:

podman restart yugabyte-2.11

Please be aware that the yugabyte container must be stopped prior to stopping the podman machine. The podman machine might need to be stopped if no containers need running, and will be stopped if Mac is going to be turned off or restarted. If the yugabyte container is not stopped, it will leave a file in place indicating that yugabyte YSQL is running, which will prevent YSQL from starting up if the container is started again.

podman, containers and host restart

During the setup above, the podman machine has been initialized and is ready for use. After a host reboot, the podman machine doesn't need to be initialized again. However, the podman machine must be started after a reboot, it isn't started automatically:

podman machine start

Once the podman machine is started, you can query the container statuses. By default containers are not automatically started on podman machine startup. To query the status of the containers including non-running containers, use the '--all' flag:

podman ps --all

CONTAINER ID  IMAGE                    COMMAND               CREATED      STATUS                     PORTS                                                                   NAMES
701422c063b4  ../yugabyte:2.11.0.0-b7  bin/yugabyted sta...  2 hours ago  Exited (0) 10 minutes ago  0.0.0.0:5433->5433/tcp, 0.0.0.0:7000->7000/tcp, 0.0.0.0:9000->9000/tcp  yugabyte-2.11

This shows that our yugabyte-2.11 container still is there, but it is not running. In order to use it, start the container:

podman start yugabyte-2.11

yugabyte-2.11

If we run podman ps again, we can validate the container is now running:

podman ps --all

CONTAINER ID  IMAGE                    COMMAND               CREATED      STATUS                     PORTS                                                                   NAMES
701422c063b4  ../yugabyte:2.11.0.0-b7  bin/yugabyted sta...  2 hours ago  Up 40 seconds ago  0.0.0.0:5433->5433/tcp, 0.0.0.0:7000->7000/tcp, 0.0.0.0:9000->9000/tcp  yugabyte-2.11

One way of using YSQL is to install postgresql on mac via brew (brew install postgresql). You can then run psql on the CLI directly to access YSQL in the container.

The database and its contents do survive stopping and starting the container, including if this has happened as part of a restart of the host. If a container is removed, the data is removed with it.

Remove podman machine

The podman machine running in qemu can be stopped, and removed:

podman machine stop
podman machine rm

If the podman machine is removed, all the containers it hosted are removed with it.

The podman files are stored in the following place:
~/.config podman machine configuration file
~/.local podman machine disk image
~/.ssh podman machine private and public key

Containers and their configuration are stored inside the podman machine.

↧

Paul Ramsey: Tricks for Faster Spatial Indexes

December 3, 2021, 9:07 am

≫ Next: Regina Obe: PostGIS 3.2.0beta3 Released

≪ Previous: Frits Hoogland: podman machine on mac OSX 12.0.1 (Monterey)

One of the curious aspects of spatial indexes is that the nodes of the tree can overlap, because the objects being indexed themselves also overlap.

↧

Regina Obe: PostGIS 3.2.0beta3 Released

December 3, 2021, 4:00 pm

≫ Next: Andreas 'ads' Scherbaum: Emre Hasegeli

≪ Previous: Paul Ramsey: Tricks for Faster Spatial Indexes

The PostGIS Team is pleased to release the third beta of the upcoming PostGIS 3.2.0 release.

Best served with PostgreSQL 14. This version of PostGIS can utilize the faster GiST building support API introduced in PostgreSQL 14. If compiled with recently released GEOS 3.10.1 you can take advantage of improvements in ST_MakeValid and numerous speed improvements. This release also includes many additional functions and improvements for postgis, postgis_raster and postgis_topology extensions and a new input/export format FlatGeobuf.

Continue Reading by clicking title hyperlink ..

↧

Andreas 'ads' Scherbaum: Emre Hasegeli

December 6, 2021, 6:00 am

≫ Next: Luca Ferrari: pgdump, text and xz

≪ Previous: Regina Obe: PostGIS 3.2.0beta3 Released

PostgreSQL Person of the Week Interview with Emre Hasegeli: I was born and grew up in İzmir, Turkey, studied in İstanbul, lived and worked in Germany and in the UK for a while, and moved back to my hometown this year. I am currently working remotely for End Point, a US based software consultancy company which develops Bucardo.

↧

Luca Ferrari: pgdump, text and xz

December 5, 2021, 4:00 pm

≫ Next: Luca Ferrari: kill that backend!

≪ Previous: Andreas 'ads' Scherbaum: Emre Hasegeli

A not-scientific look at how to compress a set of SQL dumps.

pgdump, text and xz

I have a database that contains around 50 GB of data. I do continuos backup thru pgBackRest, I also do regular pg_dump in directory format via multiple jobs, so I’m fine with backups.
However, why not have a look at SQL backups?
First of all: the content of the database is mostly numeric, being a quite large container of sensors data. This means that the data should be very good for compression.
Moreover, tables are partitioned on a per-year and per-month basis, therefore I have a regular structure with one year table and twelve month childrens. For instance, in the current year there is a table named y2021 with other partitions named y2021m01 thru y2021m12.

`pg_dump` in text mode

I did a simple for loop in my shell to produce a few backup files, separating every single file by its year:

% for y in$(echo 2018 2019 2020 2021 2022 );do
echo"Backup year $y"time pg_dump -h miguel -U postgres -f sensorsdb.$y.sql -t"respi.y${y}*" sensorsdb
done

This produce the following amount of data:

% ls-sh1*.sql     
3,5G sensorsdb.2018.sql
 13G sensorsdb.2019.sql
 12G sensorsdb.2020.sql
 10G sensorsdb.2021.sql
 20K sensorsdb.2022.sql

The following is a table that summarizes the file size and the time required to create it:

year	SQL size	time
2018	3.5 GB	7 minutes
2019	13 GB	20 minutes
2020	12 GB	20 minutes
2021	10 GB	17 minutes

Compress them!

Use xz with the default settings, that according to my installation is a compression level 6:

% for y in$(echo 2018 2019 2020 2021 2022 );do
echo"Compress year $y"time xz sensorsdb.$y.sql                                                          
done

Compress year 2018
xz sensorsdb.$y.sql  2911,75s user 12,62s system 98% cpu 49:22,22 total
Compress year 2019
xz sensorsdb.$y.sql  7411,57s user 41,22s system 98% cpu 2:06:24,38 total
Compress year 2020
xz sensorsdb.$y.sql  6599,22s user 19,08s system 98% cpu 1:52:07,38 total
Compress year 2021
xz sensorsdb.$y.sql  5487,37s user 15,25s system 98% cpu 1:33:08,32 total
Compress year 2022
xz sensorsdb.$y.sql  0,01s user 0,01s system 36% cpu 0,069 total

It requires from one to two hours to compress every single file, as summarized in the following table:

File size	Time	Compressed size	Compression ratio
3.5 GB	50 minutes	227 MB	92 %
13 GB	2 hours	766 MB	94 %
12 GB	2 hours	658 MB	94 %
10 GB	1 and half hour	566 MB	94 %

Therefore, xz is a great tool to compress dump data, especially if that data is textual and most in a numeric form. Unluckily, xz results a little slow when applied with the default compression.
How much does it take to decompress the data? Well, it takes around 4 minutes for every file, that is much faster than the compression.

Just as a comparison, doing a compression with -2 instead of -6 requires around one quarter of the time doing only 1/3 of less compression, e.g., 13 GB required 35 minutes instead of 120 minutes, requiring 1.1 GB of disk space instead of 0.77 GB. Let's see the result using-2` as default compression:

File size	Time	Compressed size	Compression ratio
3.5 GB	10 minutes	338 MB	90 %
13 GB	35 minutes	1.1 GB	91 %
12 GB	37 minutes	918 MB	92 %
10 GB	30 minutes	786 MB	92 %

As you can see, using compression -2 can greatly improve the speed of compression with a minum extra disk space requirement.
What about a directory format of dumping? Well, the same backup with pg_dump -Fd, that defaults at creating compressed objects, required 4.7 GB of disk space. The xz version requires from 3.1 GB (compression -2) to 2.2 GB (compression -6).

Conclusions

xz can help you save a lot of disk storage for textual (SQL) backups, but the default compression level could require an huge amount of time, especially on not-so-poweful machines. However, a lower level of compression can greatly make pg_dump and xz as fast as pg_dump -Fd with some extra space saving.

↧

Luca Ferrari: kill that backend!

December 5, 2021, 4:00 pm

≫ Next: Michael Christofides: Some indexing best practices

≪ Previous: Luca Ferrari: pgdump, text and xz

How to kill a backend process, the right way!

`kill` that backend!

Sometimes it happens: you need, as a DBA, to be harsh and terminate a backend, that is a user connection.
There are two main ways to do that:

use the operating system kill(1) command to, well, kill such process;
use PostgreSQL administrative functions like pg_terminate_backend() or the more polite `pg_cancel_backend()**.

PostgreSQL `pg_cancel_backend()` and `pg_terminate_backend()`

What is the difference between the two functions?
Quite easy to understand: pg_cancel_backend() sends a SIGINT to the backend process, that is it asks politely to exit. It is the equivalent of a standard kill -INT against the process.
But, what does it mean to aks politely to exit? It means to cancel the current query, that is it does not terminates the user session, rather the user interaction. That is why it is mapped to SIGINT, the equivalent to CTRL-c (interrupt by keyboard).
On the other hand, pg_terminate_backend() sends a SIGTERM to the process, that is equivalent to kill -TERM and forces brutally the process to exit.

Now, Kill it!

Which method should you use?
If you are absolutely sure about what you are doing, you can use whatever method you want!
But sometimes caffeine is at a too low level in your body to do it right, you should use the PostgreSQL way! There are at least two good reasons to use the PostgreSQL administrative functions:

you don’t need access to the server, i.e., you don’t need an operating system shell;
you will not accidentally kill another process.

The first reason is really simple to understand, and improves security about the machine hosting PostgreSQL, at least in my opinion.
The second reason is a little less obvious, and relies on the fact that pg_cancel_backends() and pg_terminate_backend()act only against processes within the PostgreSQL space, that is only processes spawn by the postmaster.
Let’s see this in action: imagine we select the wrong process to kill, like 174601 that is running Emacs on the server.

% ssh luca@miguel 'ps -aux | grep emacs'
luca      174601  1.6  4.6 320068 46584 pts/0    S+   08:40   0:04 emacs


% psql -h miguel -U postgres -c"SELECT pg_cancel_backend( 174601 );" testdb
WARNING:  PID 174601 is not a PostgreSQL server process
 pg_cancel_backend 
-------------------
 f
(1 row)



% psql -h miguel -U postgres -c"SELECT pg_terminate_backend( 174601 );" testdb
WARNING:  PID 174601 is not a PostgreSQL server process
 pg_terminate_backend 
----------------------
 f
(1 row)

As you can see, there is no way to misbehave against a non PostgreSQL process! The logs provide, of course, the very same warning message:

WARNING:  PID 174601 is not a PostgreSQL server process

Now, imagine what happened if the administrator did run something like:

% ssh luca@miguel 'sudo kill 1747601'

The process, in this case Emacs, would have been killed.

Conclusions

While you can always use the well known Unix tools to interact with PostgreSQL processes, it is strongly suggested to use the PostgreSQL tools. This improves safety checks and requires less effort in keeping track of what is happening on the cluster.

↧

1. Overview

2. Regression test

3. How to run a specific test?

4. Summary

The case

First ask for the logs!

The findings

Conclusions to the customer

My own conclusions

Query Identifier

autovacuum and auto-analyze Logging Enhancements

Connecting Logging

Conclusion

PostgreSQL: HEAP, BTREE and GIN

OMDB sample database

Google-like Text Search Queries

YBGIN, the GIN index in YugabyteDB

Introduction

Single Table Modification

Multiple Tables Modification

Pre-Update State of Tables

Collecting Multiple Transition Tables

Summary

What primary keys and UNIQUE constraints do

The key to success: NULL handling

Using foreign keys

Finally…

Introduction

A recap of the data analysis process

About the sample dataset

Cleaning the data

Note on cleaning approach:

Correcting structural issues

TimescaleDB hypertables and why timestamp is important

Changing date-time structure

Changing column data types

Creating a VIEW

Creating or generating relevant data

Adding data to a hypertable

Renaming values

Fill in missing data

Wrap Up

Database and Users

Setup

Verify podman machine

Containers & yugabyte

podman, containers and host restart

Remove podman machine

pgdump, text and xz

pg_dump in text mode

Compress them!

Conclusions

kill that backend!

PostgreSQL pg_cancel_backend() and pg_terminate_backend()

Now, Kill it!

Conclusions

Creating a `VIEW`

`pg_dump` in text mode

`kill` that backend!

PostgreSQL `pg_cancel_backend()` and `pg_terminate_backend()`