Tomas Vondra: count_distinct - improved memory usage

Last week, I briefly explained the recent improvements in count_distinct, a custom alternative to COUNT(DISTINCT) aggregate. I presented some basic performance comparison illustrating that count_distinct is in most cases significantly faster than the native COUNT(DISTINCT) aggregate. However, performance was not the initial goal of the changes, and the improvement in this respect is not that big (a few percent, maybe, which is negligible when compared to ~3x speedup against COUNT(DISTINCT)). The main issue was excessive memory consumption, and I promised to present the improvements. So here we go!

If I had to show you a single chart illustrating the improvements, it'd be this one. It compares the amount of memory necessary to aggregate the large tables from the previous post (100M rows). Remember, the lower the values the better, and red is the new implementation:

While the old implementation used 1.5GB - 2GB in most cases, the new implementation needs just ~600MB. I don't want to brag, but that's a pretty significant improvement (also you could also say I messed up the initial implementation and now I merely fixed it).

So, why did the hash table suck so much?

Clearly, the original count_distinct implementation was quite bad when dealing with memory, and there are two or three reasons why. If you're writing extensions in C, this might give you a few hints for improvements.

Simple hash table implementation

First, I have used a very simple hash table implementation, resulting in some unnecessary overhead for each value - you need the buckets, a pointer to the next item in the same bucket, etc. This could be lowered a bit by using a more elaborate hash table implementation (e.g. open addressing, but either the gains were not as significant as I expected, or the resulting implementation was more complex than I wished for, or slower.

After a lot of experiments, I switched to this simple array-based structure:

typedefstructelement_set_t{uint32item_size;/* length of the value (depends on the actual data type) */uint32nsorted;/* number of items in the sorted part (distinct) */uint32nall;/* number of all items (unsorted part may contain duplicates) */uint32nbytes;/* number of bytes in the data array *//* aggregation memory context (reference, so we don't need to do lookups repeatedly) */MemoryContextaggctx;/* elements */char*data;/* nsorted items first, then (nall - nsorted) unsorted items */}element_set_t;

And this resulted in pretty much zero per-item overhead. Yes, the array will be always a bit larger than necessary, but on average the overhead is going to be much lower than with any hash table implementation.

Because if for each 32-bit item you have to keep a 64-bit pointer to the next item, it's a pretty significant overhead. The hash table I was using was not really this dumb (it was using batches of items, ...), but you get the idea.

Allocating tiny pieces of memory

Second, I was using palloc to allocate very small pieces of memory (just a few bytes a time). That however results in significant hidden overhead, because each palloc prepends the allocated piece with a ~16B header. So calling palloc(50) actually means you'll get 66 bytes of memory - 16B header, followed by the 50B you requested.

This header contains information associating the allocated piece with a MemoryContext, which is a concept used by PostgreSQL to simplify memory management. I'm not going to explain the details of how MemoryContext works (read the comments in palloc.h), but if you had to deal with a complex C codebase, you probably know how tedious and error-prone memory management can be. Tracking every little piece of memory as it is passed from function to function is no fun, and the MemoryContext concept makes this much easier, and the palloc overhead is a price for that.

But there are ways to lower the costs - the most obvious one is allocating larger pieces of memory. You'll still get the 16B header, but the overhead will be much lower (i.e. 16B when allocating a 50B piece is ~30%, for 500B it's only ~3%). And that's exactly what the array-based implementation does. It simply allocates a single array as a whole, not a tiny piece of memory for each separate item.

Memory context internals

And the third reason is really about memory context internals. The truth is memory contexts don't internally work with pieces of arbitrary size. They work with sizes that are powers of 2. So if you do palloc(10), the implementation really does palloc(16) because 16 is the nearest power of 2. Similarly for higher values - palloc(1100) allocates the same amount of memory as palloc(2048), and so on.

So how to use this for your benefit? Well, only request sizes that are powers of two, matching the memory context internals perfectly. Because that minimizes the amount of memory you can't actually use. And again, this is exactly what the new implementation does.

Memory consumption comparison

So, let's see the memory consumption of the new (sorted array) implementation, compared to the old (hash table) one. I'll use the same datasets and queries as in the previous post. The consumed memory was measured placing a call to MemoryContextStats to an appropriate place in code, so it includes everything including palloc overhead, overhead of the structures, unused space in the array and of course overhead of the HashAggregate itself (dynahash, palloc, ...).

The hash-table and sorted-array denote the old and new implementation, respectively. I've omitted the native aggregation, because that always leads to a GroupAggregate plan, keeping a single group in memory (making the memory consumption mostly irrelevant). The cases with the largest number of groups are omitted for the same reason (see "Aggregates vs. sort" in the previous post for explanation).

The data column tracks how much actual data needs to be kept in memory (groups x items per group x 4B).

Small dataset (100k rows)

groups	items per group	data	hash-table	sorted-array
1	100000	390 kB	8440 kB	568 kB
10	10000	390 kB	4249 kB	889 kB
100	1000	390 kB	4088 kB	1016 kB
10k	10	390 kB	8184 kB	4088 kB
992	100	387 kB	2040 kB	1016 kB
992	9	35 kB	1016 kB	248 kB
101	640	258 kB	2040 kB	504 kB

Medium dataset (10M rows)

groups	items per group	data	hash-table	sorted-array
1	10000000	39 MB	260 MB	64 MB
10	1000000	39 MB	218 MB	44 MB
100	100000	39 MB	353 MB	52 MB
10000	1000	39 MB	224 MB	88 MB
9992	1000	39 MB	177 MB	63 MB
9992	80	3 MB	24 MB	8 MB
10000	644	25 MB	104 MB	48 MB

Large dataset (100M rows)

groups	items per group	data	hash-table	sorted-array
1	100000000	381 MB	2068 MB	512 MB
10	10000000	381 MB	2384 MB	640 MB
100	1000000	381 MB	2116 MB	406 MB
10000	10000	381 MB	2557 MB	786 MB
9999	10000	381 MB	1755 MB	589 MB
9999	96	4 MB	32 MB	8 MB
100000	644	245 MB	1040 MB	408 MB

Summary

It's immediately clear that the new implementation uses only a fraction of memory (usually 10-25%) of the hash table implementation. So either the old implementation was really bad, or reducing the overhead redution really helped a lot. Most likely it's a combination of both.

The most important "lessons learned" for me is that sometimes we concentrate too much on optimizing the current approach (implementation based on a hash table), while it'd actually be better to re-evaluate the assumptions and try a completely different approach. Because hash tables are great for lookups, but maybe it's not the best tool when you need to remove duplicate values from a set.

It's also clear that the new implementation still uses about ~2x the amount of memory required by the data itself. I believe that's pretty much the minimum achievable overhead while still keeping the performance. First, the doubling the array size pretty much means ~50% overhead on average (right before doubling it's 0%, right after doubling it's 100%, thus 50% on average). Also, there's additional overhead outside this extension - the Hash Aggregate needs to keep it's own hash table and some other stuff, etc.

There are certainly ways to improve Hash Aggregate in general - for example in the discussion below the previous post, Noah Misch asked if it was possible to limit the memory consumption by spilling the groups to disk. It's certainly possible, but not from the extension, because it only sees isolated groups - it's pretty much impossible to make good decisions about when to spill to disk, what to spill, etc. This needs to be done at the Hash Aggregate level, and by coincidence this it currently discussed at pgsql-hackers. It's a quite complex change and there's a lot of work to be done on that (so don't hold your breath).

Tomas Vondra: count_distinct - improved memory usage

So, why did the hash table suck so much?

Simple hash table implementation

Allocating tiny pieces of memory

Memory context internals

Memory consumption comparison

Small dataset (100k rows)

Medium dataset (10M rows)

Large dataset (100M rows)

Summary

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112