Quantcast
Channel: Planet PostgreSQL
Viewing all articles
Browse latest Browse all 9649

Chris Travers: Math and SQL, part 2: Functions and First Normal Form

$
0
0
There is probably no piece of relational database theory which is so poorly understood in professional literature than first normal form.  This piece seeks to ground an understanding of 1NF in set mathematics not only so that it can be better leveraged but also so that one can be aware of when (and when not) to violate it.

I:  Defining First Normal Form


In the previous post, we talked about the definition of function and relation.  This is key to understanding first normal form.

First Normal Form requires two things, as defined by Codd:
  1. Each relation is a well-formed relation (i.e. a well-formed set of fact correlations). 
  2. No element of the relation (i.e. correlated fact) is itself a set.  Therefore sets within sets are not allowed.
By extension the second prong prevents elements of the relation from having forms which are readily reduced to sets.  For example, an unordered comma separated list of values violates 1NF.  A clearer way of putting the second prong is:

Each element of a relation must contain a single value for its domain.

A clear mathematical way of looking at First Normal Form is:


  1. Each relation is a set of tuples.
  2. No tuple contains a set.


II: Sets and Ordinality


Let's further note that one defining feature of a set is that it has no ordering.  More formally we would say that sets have no ordinality.  {1, 2, 3} = {2, 1, 3}.  For this reason, no duplicate rows may be allowed, as that means that a relation is not a well-formed set.  

Tuples (the correspondences which make up the relation) do have mandatory ordering, and allow duplication.  However the ordering is not "natural" and therefore can be specified in relational operations.  In relational algebra, the project operation allows you to generate one tuple ordering from another.

Ordinality is an extremely important concept regarding normalization and arrays in the database, as we will see with SQL arrays.

III: SQL Arrays, Ordinality and 1NF


An SQL array is actually a mathematical matrix, carrying with it all the principle properties of this.  In particular, two basic constraints exist:

  1. SQL arrays are ordinal, so [1, 2, 3] != [2, 1, 3]
  2. Multi-dimensional arrays must be well-formed matrices.  I.e. [[2, 1, 3], [3, 2, 1]] is ok but [1, 2, [3, 2]] is not.
Arrays can be used in a number of different ways.  Some of these violate first normal form.  Others do not.  For example the following does not violate 1NF (though I am deliberately avoiding PostgreSQL's built-in types for networking):

CREATE TABLE network_addrs (
    host_id int primary key,
    hostname text,
    ip_as_octets int[]
);
INSERT INTO network_addrs(1, 'myhost', ARRAY[192,168,1,124]);

Because [192,168,1,124] and [124,168,1,192] are not equivalent, first normal form is not violated.

On the other hand consider:

CREATE TABLE web_page (
      file_name text primary key,
      content text,
      tags text[]
);
INSERT INTO web_page ('home.html', '<html></html>', ARRAY['foo', 'bar']);

Because [foo, bar] is equivalent to [bar, foo], we have a 1NF violation.  Now, in PostgreSQL, there may be good reasons to do this, but there are significant costs in any RDBMS as well.

The basic tradeoff in PostgreSQL in the latter version is that you gain the possibility of optimizing select queries for tags using GIN indexes.  However, deleting a tag from the system becomes very painful.  The difference between the one which violates 1NF using tags and the one which properly uses arrays for IP addresses is that I am only going to manage the IP address as it exists at once.  I will never be modifying or deleting octets individually.  For web pages, however, I may need to reach into the array with SQL to update or delete elements.  

This is a good way of thinking about when arrays violate 1NF, and it presents an understanding of some practical problems that come into play when 1NF is violated, but doesn't quite match the theory when tuples are stored as relational elements.  More on this will be covered below.

IV:  1NF, Candidate Keys, and Functional Dependencies


Since a 1NF relation is a set, it has no duplicates.  Thus, the relation itself can be used as a domain for a function, the result of which is its members.  Thus the whole tuple can be said to be a candidate key, and every member a trivial function of it.

Interestingly PostgreSQL supports queries notating columns as functions of relations.  While we might normally run something like:

select proname, nspname 
  from pg_proc p 
  join pg_namespace n ON p.pronamespace = n.oid;

We can also write it as:

select proname, nspname 
  from pg_proc p 
  join pg_namespace n ON pronamespace(p) = oid(n);

These notations are actually directly equivalent, and while the second is sufficiently non-standard that I would probably avoid it for production code, it has the advantage of making the relationship between function and attribute quite clear.

V:  Notes about Tuples in Tuples


Now, technically storing tuples inside of tuples in relations never violates first normal form by itself (though arrays of tuples might be if they are not ordinal).  This is because tuples are always ordinal.  Nonetheless, storing tuples inside of tuples can result in some problems which look very much like 1NF violation problems.  The same holds true with JSON, and other programming structures.  These don't violate 1NF themselves, but they can cause problems of a similar sort depending on how they are used.

In general, a decent rule to live by is that if a tuple will be updated as a whole, problems don't exist but that problems begin to exist when a tuple may be updated (logically) by merely replacing elements in it.  If one thinks of tuples as immutable (more on this in an upcoming post), these problems are solved.

This is an area where functional programming as a mentality makes database design a lot easier.  One should think of tuples as immutable correlated data structures, not as internal state of the system.

VI:  When to Break 1NF


There are many times when 1NF must be broken.  

In the real world we don't often get to verify that our data conforms to rigid mathematical definitions before we load it.  For this reason, the non-duplication rule often must be ignored for data import workflows.  In these cases, one loads the data with duplicates, and filters them out with subsequent operations.  This is acceptable.

Additionally there are cases where the workloads are such that, with PostgreSQL, GIN indexes can give you major performance boosts when you need it.  This comes at a cost, in that deleting a valid element from the system can no longer perform well, but sometimes this cost is worth it for relatively unimportant and freeform data.

VII:  Perl modules and Related Concepts


In writing this I came across a number of Perl modules which are closely related to this post.

The first is Set::Relation, which implements basic relational logic in Perl.  This is worth looking at if you are trying to see how to implement set logic without an RDBMS.  A very nice feature is that by default, relations are immutable, which means you have an ability to use them in a fully Functional Programming environment.

The second is Math::Matrix, which gives you a library for matrix mathematics.  For integer and float arrays, this allows you to incorporate this into the database back-end using PL/PerlU procedures, and would be particularly useful when combined with SQL language arrays (either in the back-end or middle-ware).

VIII:  Final Thoughts


First Normal Form is often treated as a given in relational theory.  However since most books aimed at database professionals don't cover the mathematical basis for the model, it is generally poorly understood.  Understanding First Normal Form however is the beginning of understanding relational logic and relational database systems.  It therefore deserves much more exploration than the brief space it is usually given.

Understanding 1NF is particularly important when one decides to use a non-1NF database design.  These have significant tradeoffs which are not evident to beginners.  However, this is only the starting point.  In future discussions we will cover related aspects which give a more complete understanding of the basics of the system.

Next Week:  Immutability, MVCC, and Functional Programming


Viewing all articles
Browse latest Browse all 9649

Trending Articles