Last week, I briefly explained the recent improvements in count_distinct, a custom alternative to COUNT(DISTINCT)
aggregate. I presented some basic performance comparison illustrating that count_distinct
is in most cases significantly faster than the native COUNT(DISTINCT)
aggregate. However, performance was not the initial goal of the changes, and the improvement in this respect is not that big (a few percent, maybe, which is negligible when compared to ~3x speedup against COUNT(DISTINCT)
). The main issue was excessive memory consumption, and I promised to present the improvements. So here we go!
If I had to show you a single chart illustrating the improvements, it'd be this one. It compares the amount of memory necessary to aggregate the large tables from the previous post (100M rows). Remember, the lower the values the better, and red is the new implementation:
While the old implementation used 1.5GB - 2GB in most cases, the new implementation needs just ~600MB. I don't want to brag, but that's a pretty significant improvement (also you could also say I messed up the initial implementation and now I merely fixed it).
So, why did the hash table suck so much?
Clearly, the original count_distinct
implementation was quite bad when dealing with memory, and there are two or three reasons why. If you're writing extensions in C, this might give you a few hints for improvements.
Simple hash table implementation
First, I have used a very simple hash table implementation, resulting in some unnecessary overhead for each value - you need the buckets, a pointer to the next item in the same bucket, etc. This could be lowered a bit by using a more elaborate hash table implementation (e.g. open addressing, but either the gains were not as significant as I expected, or the resulting implementation was more complex than I wished for, or slower.
After a lot of experiments, I switched to this simple array-based structure:
typedefstructelement_set_t{uint32item_size;/* length of the value (depends on the actual data type) */uint32nsorted;/* number of items in the sorted part (distinct) */uint32nall;/* number of all items (unsorted part may contain duplicates) */uint32nbytes;/* number of bytes in the data array *//* aggregation memory context (reference, so we don't need to do lookups repeatedly) */MemoryContextaggctx;/* elements */char*data;/* nsorted items first, then (nall - nsorted) unsorted items */}element_set_t;
And this resulted in pretty much zero per-item overhead. Yes, the array will be always a bit larger than necessary, but on average the overhead is going to be much lower than with any hash table implementation.
Because if for each 32-bit item you have to keep a 64-bit pointer to the next item, it's a pretty significant overhead. The hash table I was using was not really this dumb (it was using batches of items, ...), but you get the idea.
Allocating tiny pieces of memory
Second, I was using palloc
to allocate very small pieces of memory (just a few bytes a time). That however results in significant hidden overhead, because each palloc
prepends the allocated piece with a ~16B header. So calling palloc(50)
actually means you'll get 66 bytes of memory - 16B header, followed by the 50B you requested.
This header contains information associating the allocated piece with a MemoryContext, which is a concept used by PostgreSQL to simplify memory management. I'm not going to explain the details of how MemoryContext
works (read the comments in palloc.h), but if you had to deal with a complex C codebase, you probably know how tedious and error-prone memory management can be. Tracking every little piece of memory as it is passed from function to function is no fun, and the MemoryContext
concept makes this much easier, and the palloc
overhead is a price for that.
But there are ways to lower the costs - the most obvious one is allocating larger pieces of memory. You'll still get the 16B header, but the overhead will be much lower (i.e. 16B when allocating a 50B piece is ~30%, for 500B it's only ~3%). And that's exactly what the array-based implementation does. It simply allocates a single array as a whole, not a tiny piece of memory for each separate item.
Memory context internals
And the third reason is really about memory context internals. The truth is memory contexts don't internally work with pieces of arbitrary size. They work with sizes that are powers of 2. So if you do palloc(10)
, the implementation really does palloc(16)
because 16 is the nearest power of 2. Similarly for higher values - palloc(1100)
allocates the same amount of memory as palloc(2048)
, and so on.
So how to use this for your benefit? Well, only request sizes that are powers of two, matching the memory context internals perfectly. Because that minimizes the amount of memory you can't actually use. And again, this is exactly what the new implementation does.
Memory consumption comparison
So, let's see the memory consumption of the new (sorted array) implementation, compared to the old (hash table) one. I'll use the same datasets and queries as in the previous post. The consumed memory was measured placing a call to MemoryContextStats
to an appropriate place in code, so it includes everything including palloc
overhead, overhead of the structures, unused space in the array and of course overhead of the HashAggregate itself (dynahash, palloc
, ...).
The hash-table
and sorted-array
denote the old and new implementation, respectively. I've omitted the native
aggregation, because that always leads to a GroupAggregate
plan, keeping a single group in memory (making the memory consumption mostly irrelevant). The cases with the largest number of groups are omitted for the same reason (see "Aggregates vs. sort" in the previous post for explanation).
The data
column tracks how much actual data needs to be kept in memory (groups x items per group x 4B).
Small dataset (100k rows)
groups | items per group | data | hash-table | sorted-array |
---|---|---|---|---|
1 | 100000 | 390 kB | 8440 kB | 568 kB |
10 | 10000 | 390 kB | 4249 kB | 889 kB |
100 | 1000 | 390 kB | 4088 kB | 1016 kB |
10k | 10 | 390 kB | 8184 kB | 4088 kB |
992 | 100 | 387 kB | 2040 kB | 1016 kB |
992 | 9 | 35 kB | 1016 kB | 248 kB |
101 | 640 | 258 kB | 2040 kB | 504 kB |
Medium dataset (10M rows)
groups | items per group | data | hash-table | sorted-array |
---|---|---|---|---|
1 | 10000000 | 39 MB | 260 MB | 64 MB |
10 | 1000000 | 39 MB | 218 MB | 44 MB |
100 | 100000 | 39 MB | 353 MB | 52 MB |
10000 | 1000 | 39 MB | 224 MB | 88 MB |
9992 | 1000 | 39 MB | 177 MB | 63 MB |
9992 | 80 | 3 MB | 24 MB | 8 MB |
10000 | 644 | 25 MB | 104 MB | 48 MB |
Large dataset (100M rows)
groups | items per group | data | hash-table | sorted-array |
---|---|---|---|---|
1 | 100000000 | 381 MB | 2068 MB | 512 MB |
10 | 10000000 | 381 MB | 2384 MB | 640 MB |
100 | 1000000 | 381 MB | 2116 MB | 406 MB |
10000 | 10000 | 381 MB | 2557 MB | 786 MB |
9999 | 10000 | 381 MB | 1755 MB | 589 MB |
9999 | 96 | 4 MB | 32 MB | 8 MB |
100000 | 644 | 245 MB | 1040 MB | 408 MB |
Summary
It's immediately clear that the new implementation uses only a fraction of memory (usually 10-25%) of the hash table implementation. So either the old implementation was really bad, or reducing the overhead redution really helped a lot. Most likely it's a combination of both.
The most important "lessons learned" for me is that sometimes we concentrate too much on optimizing the current approach (implementation based on a hash table), while it'd actually be better to re-evaluate the assumptions and try a completely different approach. Because hash tables are great for lookups, but maybe it's not the best tool when you need to remove duplicate values from a set.
It's also clear that the new implementation still uses about ~2x the amount of memory required by the data itself. I believe that's pretty much the minimum achievable overhead while still keeping the performance. First, the doubling the array size pretty much means ~50% overhead on average (right before doubling it's 0%, right after doubling it's 100%, thus 50% on average). Also, there's additional overhead outside this extension - the Hash Aggregate needs to keep it's own hash table and some other stuff, etc.
There are certainly ways to improve Hash Aggregate in general - for example in the discussion below the previous post, Noah Misch asked if it was possible to limit the memory consumption by spilling the groups to disk. It's certainly possible, but not from the extension, because it only sees isolated groups - it's pretty much impossible to make good decisions about when to spill to disk, what to spill, etc. This needs to be done at the Hash Aggregate level, and by coincidence this it currently discussed at pgsql-hackers. It's a quite complex change and there's a lot of work to be done on that (so don't hold your breath).