For years we have been running on prod a service which has two unfortunate properties :
- it fragments a lot by allocating a lot of various block size values of different lifecycles
- it has a tight requirement on latency, that compaction breaks when compacting all the freelist mess.
Freelist would easily grow to ~300/400k blocks and compact stall for a few seconds to deal with this. Our idea was to compact less and fragment less by using the first_fit policy. At first it worked fine for months. Then every few months it started to stuck in minor GC. Recently it started stalling more often and we managed to get a core and understand what’s going on here is the key take away.
when the manual say you have to pay a price in allocation cost when enabling first_fit, what it actualy means is :
first_fit policy worst case minor GC is O(blocks in minor heap * blocks in free list). This can grow really big. I’m not sure it’s a bug or how it can be improved yet, but this can easily trigger loops in gc code that will run for more 30minutes.
The worst case happens when the first block in the free list is relatively big and most of the other block in freelist are small. The flp then has ~1/2 entries, and if the freelist is large like ~150k blocks, the minor gc will scan it for every minor heap block (to check if the flp can be extended with more blocks).
I’m going to try another angle and instead try to compact really often.