Golang Realtime GC in Theory and Practice

Blog-Header.jpg

How the Golang concurrent GC achieves low latencies in real-time systems: a visualization of the algorithm and an empirical comparison with other languages.

Introduction

Each day, Pusher sends billions of messages in real-time: source to destination in less than 100ms. How do we achieve this? A key factor is Go’s low-latency garbage collector.

Garbage collectors are a bane of real-time systems because they pause the program. So when designing our new message bus, we chose the language carefully. Go emphasizes low latency, but we were wary: does Go really achieve this? If so, how?

In this blog post, we’ll look at Go’s garbage collector. We’ll see how it works (the tricolor algorithm), why it works (achieving such short GC pauses), and most importantly, whether it works (benchmarking these GC pauses, and comparing them with other languages).

From Haskell to Go

The system we have been building is a pub/sub message bus with an in-memory store of published messages. This version in Go is a rewrite of our first implementation in Haskell. We stopped work on our Haskell version in May, after discovering fundamental latency problems with GHC’s garbage collector.

We published a detailed post-mortem of the Haskell implementation. The fundamental problem was that GHC’s pause times were proportional to the size of the working set (that is, the number of objects in memory). In our case, we have many objects in memory, which led to pause times of hundreds of milliseconds. This is a problem with any GC that blocks the program while it completes a collection.

Enter Go. Unlike GHC’s stop-the-world collector, Go’s collector runs concurrently with the program, which makes it possible to avoid these longer pauses. We were encouraged by Go’s focus on low latency and found it promising when we read about the latency improvements with each new version.

How does a concurrent garbage collector work?

How does Go’s GC achieve this concurrency? At the core is the tricolor mark-and-sweep algorithm. Below is an animation which shows how the algorithm works. Note how it enables the GC to run concurrently with the program; it means that the pause times become a scheduling problem. The scheduler can be configured to only run GC collections for short periods of time, interleaved with the program. This is good news for our low latency requirement!

Your browser does not support iframes.

The animation above shows the mark phase in detail. The GC still has two stop-the-world phases: the initial stack scan for root objects, and a termination of the mark phase. Excitingly, this termination phase has recently been eliminated. We will discuss this optimization later. In practice we found the pause times of these phases to be <1ms with very large heaps.

With a concurrent GC, there is also potential for running the GC in parallel on multiple processors.

throughput

If a concurrent GC can yield much lower latencies for large heap sizes, why would you ever use a stop-the-world collector? Isn’t Go’s concurrent garbage collector just better than GHC’s stop-the-world collector?

Not necessarily. Low latency has costs. The most important cost is reduced throughput. Concurrency requires extra work for synchronization and duplication, which eats into the time the program can be doing useful work. GHC’s garbage collector is optimized for throughput, but Go’s is optimized for latency. At Pusher, we care about latency, so this is an excellent tradeoff for us.

A second cost of concurrent garbage collection is unpredictable heap growth. The program can allocate arbitrary amounts of memory while the GC is running. This means the GC must be run before the heap reaches the target maximum size. But if the GC is run too soon, then more collections will be performed than necessary. This tradeoff is tricky (Austin Clements provides a great overview of this problem). At Pusher, this unpredictability has not been a problem; our programs tend to allocate memory at a predictable constant rate.

How Does It Perform in Practice?

So far, Go’s GC looks like a good fit for our latency requirements. But how does it perform in practice?

Earlier this year, when investigating the pause times in the Haskell implementation, we created a benchmark for measuring pauses. The benchmark program repeatedly pushes messages into a size-limited buffer. Old messages constantly expire and become garbage. The heap size is kept large, which is important because the heap must be traversed in order to detect which objects are still referenced. This is why GC running time is proportional to the number of live objects/pointers between them.

Here is the benchmark in Go, where the buffer is modeled as an array:

1package main
2
3import (
4        "fmt"
5        "time"
6)
7
8const (
9        windowSize = 200000
10        msgCount   = 1000000
11)
12
13type (
14        message []byte
15        buffer [windowSize]message
16)
17
18var worst time.Duration
19
20func mkMessage(n int) message {
21        m := make(message, 1024)
22        for i := range m {
23                m[i] = byte(n)
24        }
25        return m
26}
27
28func pushMsg(b *buffer, highID int) {
29        start := time.Now()
30        m := mkMessage(highID)
31        (*b)[highID%windowSize] = m
32        elapsed := time.Since(start)
33        if elapsed > worst {
34                worst = elapsed
35        }
36}
37
38func main() {
39        var b buffer
40        for i := 0; i < msgCount; i++ {
41                pushMsg(&b, i)
42        }
43        fmt.Println("Worst push time: ", worst)
44}

Following James Fisher’s blog post, Gabriel Scherer wrote a follow-up blog post which compared the original Haskell benchmark to versions in OCaml and Racket. He created a repo containing these benchmarks, and Santeri Hiltunen added a version for Java. I decided to port the benchmark to Go to see how it would perform in comparison.

Without further ado, here are the benchmark results on my system:

BenchmarkLongest pause (ms)
OCaml 4.03.0 (map based) (manual timing)2.21
Haskell/GHC 8.0.1 (map based) (rts timing)67.00
Haskell/GHC 8.0.1 (array based) (rts timing) [1]58.60
Racket 6.6 experimental incremental GC (map based) (tuned) (rts timing)144.21
Racket 6.6 experimental incremental GC (map based) (untuned) (rts timing)124.14
Racket 6.6 (map based) (tuned) (rts timing) [2]113.52
Racket 6.6 (map based) (untuned) (rts timing)136.76
Go 1.7.3 (array based) (manual timing)7.01
Go 1.7.3 (map based) (manual timing)37.67
Go HEAD (map based) (manual timing)7.81
Java 1.8.0_102 (map based) (rts timing)161.55
Java 1.8.0_102 G1 GC (map based) (rts timing)153.89

The two surprises here are Java, which performed very poorly, and OCaml, which performed very well. The ~3ms pause times for OCaml are due to the incremental GC algorithm that OCaml uses for the old generation. (Our main reason for not choosing OCaml was its poor support for concurrencymulticore parallelism.)

As you can see, Go performs well, with pause times around 7ms. This is well within our requirements.

Some caveats

Always be wary of benchmarks! Different runtimes are optimized for different use cases and different platforms. However, since we had clear latency requirements, and this benchmark represents our use-case, it shows that Go works very well for us.

map vs array based — Originally our benchmarks were based around inserting and deleting items from a map. However, Go had a bug in the GC with how it handled large maps, which obscured our results. For this reason we decided to switch the map for the mutable array above. See the merge request for this discussion. The Go map bug is fixed in Go 1.8, but not all the benchmarks have been ported over, which is why I have distinguished between the two. Despite this, there is no reason to expect GC times to be that much worse with maps (aside from bugs or poor implementations).

manual vs rts timing — As a second caveat, the benchmarks differ in how they are timed: some benchmarks use a manual timer, but others use the runtime system statistics. This difference exists because some runtimes do not make this statistic available (for example in Go). We were also concerned that turning this profiling on would adversely affect some GCs. For this reason we want to port all the benchmarks to manual timing.

A final caveat is worst-cases in the benchmark implementations. There is a chance that a worst-case amortized insert/delete map operation could adversely affect a timing, which is another reason to switch to using simple arrays.

Please contribute more languages to our benchmarks! This simple benchmark is very general, and an important one when choosing a language. It you want to see how $YOUR_LANGUAGE ’s GC performs, then please submit a PR! :) I would be particularly interested to know why the Java pause times are so bad, as the theory suggests it should be better.

Why Are the Go Results Not Better?

So using the version of the compiler with the fixed map bug, or when using an array, we get pause times of ~7ms. This is really quite good, but based on the benchmark results from the Go team on the presentation slide titled “1.5 Garbage Benchmark Latency”, we would expect pauses of around 1ms for our heap size of 200MB (GC times tend to be proportional to the number of pointers rather than the number of bytes, but they do not provide this information unfortunately). The Twitch team also describe pause times at ~1ms with Go 1.7 (although they are not clear on the number of heap objects).

I asked about this on the golang-nuts mailing list. The idea from Rhys Hilter was that these pause times could have been caused by this currently unfixed bug, where idle mark workers in the GC can block the program even if there is work to do. To try and confirm this, I fired up go tool trace [3], which visualizes the runtime behavior of a run of a program.

As you can see from this example, there is a period of 12ms where the background mark workers are running on all four processors, blocking the program. This makes me strongly suspect I am experiencing the aforementioned bug.

By this point I was happy with the existing pause times I was seeing for our benchmark, but I was keeping my eyes out for any fixes to the above issue.

As mentioned previously, there was recently some buzz around the Go team’s announcement of an improvement that had lead to GC pause times of sub 1ms. It briefly got my hopes up, but I soon realized this optimization removed one of the stop-the-world phases of the GC, which was actually already <1ms in the benchmark I was using. The problem with our pause times is that they were caused by the concurrent phase of the GC.

Nonetheless, this is a welcome improvement for the GC, and demonstrates the team’s continued focus on improving the GC’s latency. The technical description of this optimization is an interesting read in itself.

Conclusion

The key takeaway from this investigation is that GCs are either optimized for lower latency or higher throughput. They might also perform better or worse at these depending on the heap usage of your program. (Are there a lot of objects? Do they have long or short lifetimes?)

It is important to understand the underlying GC algorithm in order to decide whether it is appropriate for your use-case. It’s also important to test the GC implementation in practice. Your benchmark should exhibit the same heap usage as the program you intend to implement. This will check the effectiveness of the GC implementation in practice. As we saw, the Go implementation is not without faults, but in our case the issues were acceptable. I would love to see the same benchmark in more languages if you would like to contribute :)

Despite some issues, Go’s GC performs well compared to other GCed languages. The Go team have been improving the latency, and continue to do so. We’re happy with Go’s GC, in theory and practice.

Footnotes

[1] Not yet merged at the time of writing.
[2] The manually tuned version performed much worse for me than Gabriel Scherer. I assume this could be because the optimization is very system dependent.
[3] I had been using Go for three months before finding out about go tool trace. I was frequently disappointed when I believed nothing like it existed. Tools like this are indispensable when tracking down these kinds of RTS performance problems. If you’ve used Threadscope in Haskell, this is very similar. I’m planning on writing a tutorial on using go tool trace in the near future.

Credits:

  • James Fisher for the animation.
  • Gabriel Scherer and Santeri Hiltunen for benchmarks.