Go 1.6 GC improvements
Austin Clements and Rick Hudson
Go 1.5 is the first release with our new, low-pause concurrent garbage collector (GC). This is a significant stride over past releases and should open up Go to applications where it was previously unsuitable, but there is still plenty of work to do.
For 1.6, our primary goal with the garbage collector is polish: improve its overall performance, its performance robustness under extreme workloads, and the overall implementation, without major algorithmic changes. We have big plans on the horizon, but we’re going to use 1.6 as a stabilization period to make the garbage collector do what it does even better and to position it for future changes.
This is not a design document, but rather an overview of the changes we’re planning to make in 1.6 meant to help the GC developers and interested observers understand our short term goals. Many of the changes here are simple enough that they won’t even need a design document. Others will. Most of these are detailed in GitHub issues, but this document is meant to convey the higher level structure of these changes.
The overall coordination scheme of the 1.5 collector largely evolved out of the 1.4 collector. In some places, we’ve exceeded the original design requirements to the point that the cracks are showing.
Replace GC coordinator with decentralized state machine [hard, P1]
The garbage collector is currently coordinated from a single goroutine, even though much of the work is done on worker goroutines. This coordinator leans too much on the goroutine scheduler and this leads to complexity and, in extreme workloads, can cause nontrivial delays in phase transitions. One way to fix this is to replace the centralized coordinator with a decentralized state machine. (#11970)
Closely tied to “Redesign mark completion barrier.”
Redesign mark completion barrier [hard, P1]
This probably needs to be co-designed with the decentralized state machine. Our mark completion barrier evolved from the STW mark completion barrier and has become a difficult to maintain mess with bad performance properties. It needs to be rethought. (#12041)
Closely tied to “Replace GC coordinator with decentralized state machine.”
Embrace write barrier during scan [medium, P2]
Part way through the 1.5 freeze we discovered we had to enable write barriers during the scan phase. There is still a lot of code designed to deal with write barriers being enabled much later and we should simplify or eliminate it. We should also take advantage of this to begin marking earlier. (#11971)
Credit system improvements
There are two credit-based proportional work systems in the Go 1.5 garbage collector: one to ensure that sweeping is completed between when a GC cycle finishes and when the heap size reaches the heap trigger, and one to ensure that scanning is completed between when the heap size reaches the heap trigger and when it reaches the heap goal.
Always operate “in the black” [easy, P1]
Originally both systems were designed to operate “in the red”: the mutator allocated first and then, if its debt exceeded some bound, it dealt with it. We learned toward the end of the 1.5 cycle that this is a bad idea because it means the work system can miss its target and, in the worst case, this means it may fail to converge on its own. Both work systems have fallbacks to ensure eventual convergence, but it’s much better to operate “in the black.” We scrambled to fix this for the proportional sweep system for 1.5 (fc9ca85). We should do this for the proportional scan system, too. (#11967)
Make proportional sweep less aggressive [easy, P2]
In the process of fixing proportional sweep to operate in the black, we also made it somewhat over-aggressive, so the sweep may finish earlier than necessary. We should fix this accounting. (#12040)
May be obviated by “Eliminate sweeper,” below, but also easy to fix.
Enlarge work buffers [trivial, P2]
We made work buffers ridiculously small to mitigate the work hiding that is now solved properly by mark 2. We should make them big again to reduce overhead.
Revisit prefetching [medium, P2]
We explored using prefetching in the 1.5 collector, but never got a definitive speed-up. We should take another look at the memory latency profiles and revisit this.
Consider using dense mark bits [hard, P2]
Mark bits are currently interleaved with the heap bitmap and addressed by the first word of the object. Consider separating them from the heap bitmap and making them dense. This would have advantages for the sweeper and for writing out bitmaps during allocation (because there would no longer be concurrent access to the bitmap). May be necessary for other improvements. (Proposal forthcoming)
Mark termination performance
The mark termination phase is by far the main contributor to pause time in the Go 1.5 garbage collector. While for many applications our pause time is already under 10ms, there are several things we can do to further reduce this and to bring more extreme applications under that target.
Root marking performance
Make root marking tasks finer-grained [easy, P2]
Root marking is divided into several tasks that are run in parallel and load balanced. However, some of these tasks are coarse-grained and may be large, which leads to stragglers and extends mark termination time as well as the duration of concurrent scan. We should subdivide tasks for data, BSS, and finalizer scanning. (#10345)
Move finalizer scanning from mark termination to concurrent scan [easy, P1]
Finalizers are currently scanned during mark termination. Simply finding these finalizer takes ~1ms per GB of heap (even if there are no finalizers) and represents about half of mark termination time on large heaps. If there are finalizers, it can take significantly longer. We should move this entire process from mark termination to concurrent scan. (#11485)
Remove mSpanInUse count loop [easy, P1]
This represents the other half of mark termination time on large heaps. We should track this statistic live and eliminate this loop. (#11484)
Rescan stacks before mark termination [medium, P2]
Mark termination does not have to rescan stacks that haven’t executed since the last scan and it doesn’t have to rescan above the lowest stack barrier. We could rescan stacks toward the end of the concurrent mark phase to clean dirty stacks and move stack barriers down. (#12061)
Free stacks during concurrent sweep [medium, P3]
Currently we free stacks during mark termination. We should move this to the concurrent sweep phase. (#11465)
Sweep and scavenge
Some programs spend a significant amount of time in the sweeper, but relatively little attention has been paid to its performance in the 1.5 cycle. With a little more co-design of the sweeper and the garbage collector, this seems like a promising source of performance improvements.
Free whole spans eagerly [medium, P2]
Large object allocation currently requires a potentially expensive sweeping operation to free up large regions of memory. It may be possible to do this much more efficiently by immediately freeing unused spans at the end of GC with a little help from the mark phase. (#11979)
Fix scavenger on systems with large physical pages [hard?, P2]
The scavenger is currently disabled on systems with physical pages larger than the runtime page size. We should fix that. (#9993)
Eliminate sweeper [hard, P2]
At the end of a GC cycle, the mark bits of the heap indicate exactly which objects are free. The sweeper’s primary job is to transform this one representation of the free object set into a different representation of the free set. This suggests that we should eliminate this transformation step and use the mark bits directly for object allocation. This requires multiple sets of mark bits, but could result in significantly faster allocation since it eliminates a significant part of acquiring a span and replaces a costly and poorly predictable linked list walk with a linear scan. (Proposal forthcoming)
May depend on “Free whole spans eagerly” and/or “Consider using dense mark bits” depending on the design.
Recycle large stack spans [medium, P3]
This isn’t technically part of the sweeper, but it’s related. Currently, if a large stack span is freed during a GC cycle, that memory can’t be reused until after the GC cycle. Since we don’t manage stack memory the way we do heap memory, in theory this could lead to unnecessary RSS growth. Small stacks don’t have this problem because they can be reused as stack memory (it’s not safe to reuse them as heap memory). We should reuse large stack memory for stacks as well. (#11466)
Miscellaneous clean up
Clean up GC state resetting [easy?, P2]
We reset different bits of GC state all over the place and it’s a common source of bugs. Some of this just needs clean up. Some of it may need more thinking. (#11427)
Clean up sysmon GC triggering [easy, P3]
One less silly system goroutine. (#11637)
More tests / benchmarks [medium, P1]
Create benchmarks focused on stressing particular aspects of GC performance, such as large stacks (#10898), narrow heaps (#11694), many goroutines (#11911), many finalizers (e.g., lots of network connections), and rapid allocation (#11677, #11911). Monitor mark rate, mark termination time, heap overshoot, and utilization to detect performance problems.
Review use of finalization throughout library code [medium, P2]
There are places in the library code where finalizers are used to solve error conditions that can probably be dealt with more efficiently at a higher level. For example if every TCP connection has a finalizer and we have millions of connections the GC overhead starts to be dominated by dealing with finalizers.
These are sorted by issue number. Crossed out issues are integrated into the text above.
runtime: consider rescanning stacks before mark termination #12061
runtime: redesign GC synchronization #12041
runtime: proportional sweep is over-aggressive #12040
runtime: whole span reclaimer is expensive #11979
runtime: embrace write barriers during scan #11971
runtime: replace GC coordinator with state machine #11970
runtime: perform GC assist before allocating [easy] #11967
runtime: simplify or eliminate sysmon-triggered GC #11637
runtime: fix loop over spans in markroot #11485
runtime: remove mSpanInUse count loop from STW #11484
runtime: recycle large stack spans #11466
runtime: move delayed stack free after start-the-world #11465
runtime: merge gcResetGState and gcResetMarkState #11427
runtime: partition large memory blocks during scanning #10345
runtime: scavengelist can corrupt heap memory on 64k page size systems (ppc64) #9993