On Linux, a "compiler" is usually a synonym to gcc, but clang is gaining more and more adoption. Over the years, phoronix published severalarticlescomparing of performance of various clang and gcc versions, suggesting that while clang improves over time, gcc still wins in most benchmarks - except maybe "compilation time" where clang is a clear winner. But none of the benchmarks is really a database-style application, so the question is how much difference can you get by switching a compiler (or a compiler version). So I did a bunch of tests, with gcc versions 4.1-4.9, clang 3.1-3.5, and just for fun with icc 2013 and 2015. And here are the results.
I did two usual types of tests - pgbench, representing a transactional workload (lots of small queries), and a subset of TPC-DS benchmark, representing analytical workloads (a few queries chewing large amounts of data).
I'll present results from a machine with i5-2500k CPU, 8GB RAM and an SSD drive, running Gentoo with kernel 3.12.20. I did rudimentary PostgreSQL tuning, mostly by tweaking postgresql.conf like this:
shared_buffers=1GBwork_mem=128MBmaintenance_work_mem=256MBcheckpoint_segments=64effective_io_concurrency=32
I do have results from another machine, but in general it confirms the results presented here. The PostgreSQL was compiled like this
./configure--prefix=...makeinstall
i.e. nothing special (no custom tweaks, etc.). The rest of the system is compiled with gcc 4.7.
pgbench
I did pgbench with three dataset sizes - small (~150MB), medium (~25% RAM) and large (~200% RAM). For each scale I ran pgbench with 4 clients (which is the number of cores on the CPU) for 15 minutes, repeated 3x, and averaged the results. And all this in read-write and read-only mode.
The first observation is that once you start hitting the drives, compiler makes absolutely no measurable difference. That makes results from all the read-write tests (for all scales) uninteresting, as well as the read-only test on large dataset - for all these tests the I/O is the main bottleneck (and that's something the compiler can't really influence).
So we're left with just the read-only benchmark on small and medium datasets, where the results look like this:
compiler | tps (small scale=10) | tps (medium scale=140) |
---|---|---|
gcc 4.1.2 | 52932 | 49837 |
gcc 4.2.4 | 53071 | 50219 |
gcc 4.3.6 | 52147 | 49396 |
gcc 4.4.7 | 52597 | 49834 |
gcc 4.5.4 | 53537 | 50143 |
gcc 4.6.4 | 53238 | 49959 |
gcc 4.7.4 | 54383 | 51033 |
gcc 4.8.3 | 54494 | 51627 |
gcc 4.9.2 | 55084 | 52515 |
clang 3.1 | 55160 | 51748 |
clang 3.2 | 55848 | 52197 |
clang 3.3 | 54946 | 51906 |
clang 3.4 | 55297 | 52306 |
clang 3.5 | 55800 | 52458 |
icc 2013 | 52249 | 49197 |
icc 2015 | 52064 | 49064 |
Let's use the gcc 4.1.2 results as a baseline, and express the other results as a percentage of the baseline. So 100 means "same as gcc 4.1.2", 90 means "10% slower than gcc 4.1.2" and so on. On a chart it then looks like this (the higher the number, the better):
Not really a dramatic difference:
- gcc 4.9 and clang 3.5 are winners, with ~4-5% improvement over gcc 4.1.2
- gcc improves over time, with the exception of 4.3/4.4, where the performance dropped below 4.1
- clang is very fast right from 3.1, peaking at 3.2 (which is slightly better than 3.5)
- surprisingly, icc gives the worst results here
TPC-DS
Now, the data warehouse benchmark. I've used a small dataset (1GB), so that it fits into memory - otherwise we'd hit the I/O bottlenecks and the compilers would make no difference. First, lest's load the data - the script performs these operations:
- COPY data into all the tables
- create indexes
- VACUUM FULL (not really necessary)
- VACUUM FREEZE
- ANALYZE
The results (in seconds) look like this:
compiler | copy | indexes | vacuum full | vacuum freeze | analyze | total |
---|---|---|---|---|---|---|
gcc 4.1.2 | 110 | 131 | 168 | 5 | 8 | 422 |
gcc 4.2.4 | 105 | 128 | 162 | 5 | 8 | 408 |
gcc 4.3.6 | 103 | 127 | 160 | 4 | 7 | 401 |
gcc 4.4.7 | 102 | 127 | 160 | 4 | 7 | 400 |
gcc 4.5.4 | 101 | 126 | 160 | 4 | 6 | 397 |
gcc 4.6.4 | 103 | 128 | 162 | 5 | 8 | 406 |
gcc 4.7.4 | 100 | 122 | 156 | 3 | 6 | 387 |
gcc 4.8.3 | 101 | 122 | 155 | 3 | 6 | 387 |
gcc 4.9.2 | 102 | 118 | 150 | 3 | 8 | 381 |
clang 3.1 | 108 | 129 | 162 | 4 | 8 | 411 |
clang 3.2 | 104 | 125 | 160 | 4 | 6 | 399 |
clang 3.3 | 105 | 125 | 160 | 3 | 6 | 399 |
clang 3.4 | 106 | 126 | 161 | 3 | 8 | 404 |
clang 3.5 | 105 | 127 | 162 | 4 | 8 | 406 |
icc 2013 | 106 | 129 | 163 | 4 | 8 | 410 |
icc 2015 | 105 | 125 | 160 | 4 | 6 | 400 |
According to the totals, the difference between the slowest (gcc 4.1.2) and fastest (gcc 4.9.2) is ~10%. Again, gcc continuously improves, which is nice. Clang actually slightly slows down since 3.2, which is not so nice, and clang 3.5 is ~6.5% slower than gcc 4.9.2. And icc is somewhere in between, with a nice speedup between 2013 and 2015 versions.
But that was just loading the data, what about the actual queries? TPC-DS specifies 99 query templates. Some of those use features not yet available in PostgreSQL, leaving us with 61 PostgreSQL-compatible templates. Sadly 2 of those did not complete within 30 minutes on the 1GB dataset (clearly, room for improvement), so the actual benchmark consists of 59 queries.
Chart of total duration of three runs per query, using gcc 4.1.2 as a baseline (just like the pgbench, but this time lower numbers are better) looks like this:
Clearly, the differences are much more significant than in the pgbench results. Again, gcc continuously improves over time, with 4.9.2 being the winner here - the difference between 4.1.2 and 4.9.2 is astonishing ~15%. That's pretty significant improvement - good work, GCC developers!
Clang results fluctuate a lot - 3.1, 3.3 and 3.5 are quite good (not as good as gcc 4.9.2, though).
And icc is again somewhere in the middle - faster than gcc 4.1.2 but nowehere as fast as gcc 4.9.2 or the "good" clang versions. And this time 2015 actually slowed down (contrary to the previous results).
Summary
If your workload is transactional (pgbench-like), the compiler does not matter that much - either you're hitting disks (and the compiler does not matter at all), or the differences are within 5% from gcc 4.1.2. But if a gain this small is significant for you enough to warrant switching a compiler, you should probably consider getting a slightly more powerful hardware (CPU with more cores, faster RAM, better storage, ...).
Analytical workloads are a different case - gcc is a clear winner, and if you're using an ancient version (say, 4.3 or older), you can get ~10% speedup by switching to 4.7, or ~15% to 4.9. In any case, the newer the version, the better.