Recently, I was helping out a company that handles financial data so the quality of there data was vital to their business. With that, they dutifully used foreign keys on most tables to ensure that all data entered into PostgreSQL was as expected. They also had a fondness for natural keys. Their reasoning, which makes perfect sense, is that it makes the data more readable. I’d much rather see a row of data that has a state code with a value of NJ instead of the number 30, especially when you’re debugging your application. So, I went ahead and did a little benchmarking to see how big of an impact the foreign keys imposed.
I constructed a simple account table that referenced 4 small tables that contained pretty static data. The largest table was the county table with only 249 rows.
CREATE TABLE account ( account_id bigserial PRIMARY KEY, account_type character(3) REFERENCES account_type, account_sub_type character(4) REFERENCES account_sub_type, account_currency character(3) REFERENCES currency, account_country character(2) REFERENCES country, account_number varchar NOT NULL );
The first thing I did was to set a baseline of loading data with no foreign key constraints. I created a script that would insert 100,000 rows of data a single insert at a time. This would simulate how the table would be used by the application instead of just bulk loading the data. I then reran the same load, but with the foreign keys in place and I was pretty surprised to see how big the impact was.
Baseline: 19.1 seconds
With foreign keys: 42.2 seconds
I figured that the foreign keys would have some effect, but not a 121% decrease in performance. The was a much larger impact than I was expecting so I went ahead and tried the same test, but gradually adding the number of foreign keys.
Baseline: 19.1 seconds
1 foreign key: 26.2 seconds
2 foreign keys: 29.7 seconds
3 foreign keys: 36.4 seconds
4 foreign keys: 42.2 seconds
So for even a single foreign key constraint to the account_type table which has only 5 rows, there was a 39% decrease in performance. Some of the tables are under pretty high load and the impact of the foreign keys would end up being very significant, but the integrity of the data was of utmost importance. All of the tables that are being referred to contain pretty much static data and if anything changes, the new codes would just be added so I figured I would try replacing all of the codes with enums. PostgreSQL will throw an error if you try to insert a value that does not exist in the enum so it has the same effect of a foreign key check on the codes.
So now rerunning the test with our redefined table with the 4 enums
CREATE TABLE account ( account_id bigserial PRIMARY KEY, account_type acct_type, account_sub_type sub_type, account_currency currency_code, account_country country_code, account_number varchar NOT NULL );
I saw something else I didn’t expect:
Baseline: 19.1 seconds With Enums: 17.5 seconds
There was a 8% increase in performance. I was expecting the test with the enums to be close to the baseline, but I wasn’t expecting it to be faster. Thinking about it, it makes sense. Enums values are just numbers so we’re effectively using surrogate keys under the covers, but the users would still the the enum labels when they are looking at the data. It ended up being a no brainer to use enums for these static tables. There was a increase in performance while still maintaining the integrity of the data.