Splash the cache: how caching improved our reliability
In the wake of an incident, we implemented a caching solution to help us handle spiky load better. Here’s how we did it.
A quick jargon buster:
instance - We refer to a customer’s app as an instance. We identify a customer’s app in our system via its instance_id
.
The problem
One fine April morning, there was an incident that woke up three of our on-call engineers 😞. There was a problem with our webhooks pipeline, responsible for the webhooks features in Chatkit and Beams.
It was experiencing high latencies, meaning webhooks were not getting published for several minutes, which was caused by a massive spike in the number of webhook jobs.
The publisher
is tasked with retrieving webhook URLs, so we know where to send incoming webhook jobs. These URLs are stored in DynamoDB. The publisher
looks at the webhook job’s instance_id
, then queries DynamoDB for that instance’s configured webhook URL. These queries were the bottleneck.
The webhooks pipeline is written in Go. This is the specific line of code for that expensive query:
webhooksConfig, err := publisher.store.ListWebhooks(instanceID)
As a short-term solution, we doubled the database’s provisioned capacity. In the graph below, the red line represents the hard limits on database capacity, which you can see is manually increased twice (from 5 to 10 to 20). This approach worked, but the issue could still reoccur given a 2x larger request load.
A better solution was to switch DynamoDB from their fixed provisioned capacity mode to the on-demand capacity mode, allowing DynamoDB to autoscale behind the scenes. On the graph above, the red capacity line drops off when this change took place.
A downside of provisioned capacity mode is that you pay per request, which is 7x more expensive. Furthermore, on-demand capacity doesn’t fundamentally change the shape of the incoming traffic. The graph shows we are still prone to large spikes, which would continue to impact the reliability and scalability of our services. But we could change that with caching.
The approach
We opted for an in-memory, short-lived read cache within the publisher
component. When a request comes through, the instance’s webhook configuration is cached for 3 seconds. Within that period, any subsequent requests for that same instance is read from memory instead.
To demonstrate the impact of this change, let’s take an example. Say we have two very active customer instances. Instance A is doing 100 req/s, and instance B is doing 200 req/s. And we have two workers talking to DynamoDB.
Here is the situation before introducing the cache:
And here is the result after introducing the cache:
We go from 300 reads per second to 1.4 reads per second. This is 200x fewer database reads, dramatically reducing load. And with DynamoDB being an external service, we save on unnecessary network requests, as well as our AWS bill.
But how did it actually fare in production? See for yourself:
No spikes, far fewer requests overall. We approximately reduced the load on DynamoDB by 7x (which compensates for the “on-demand” pricing). We get the best of both worlds 🌍.
The cache acts like a filter for high-volume instances (for those of you who like electronics, this basically means that it works like a decoupling capacitor). Although it doesn’t do anything for one-off requests, it shields us from harm when those large request waves come through, which is exactly when systems are more prone to failure.
We can expect to get more customers over time, which will mean more spikes. Being able to deal with this future-proofs the webhooks pipeline for the medium term.
The rationale
Isn’t caching bad? What about stale data? 🤔
Slightly stale data isn’t always the end of the world. The choice to cache is heavily dependent on the data you are handling. Caching would not be acceptable where data consistency is important, e.g. chatroom messages in Chatkit (because a user getting the latest messages is critical).
In our case here, the database holds customer webhook configurations, which customers update in the Pusher dashboard (shown below).
When a customer makes changes to their webhooks configuration, odds are it will be a few moments before any new webhook events are fired, so we can tolerate a short period of stale data. Indeed, we set the Time To Live (TTL) for the webhook cache to 3 seconds, which is nice and short.
Why an in-memory cache? Why not use something like Redis? 🧐
- No external dependencies, so less infrastructure to setup and maintain, reducing complexity.
- Minimal failure rate, which gives greater resilience.
- No runtime errors.
- In-memory is orders of magnitude faster than any network request. Indeed, the time it would take for a request to reach Redis would be about the same as reaching DynamoDB. And both being key-value stores, their response times would also be similar.
- The webhooks pipeline publisher only has two workers, so we can get away with it.
Although Redis would maximise the theoretical cache hit rate, we can still have very good results without it. Here is a view of the cache’s hit rate since implementation.
The graph shows that after implementing the cache, its hit rate has been hovering between 70% and 80%. Not bad for an in-memory cache.
The code
Here’s a simplified version of the implementation (not including imports, error handling etc), which is written in Go. For our cache implementation, we use the karlseguin/ccache library. As an optimisation, we also took advantage of Go’s single flight package, to avoid duplicate requests when there is a cache miss.
const defaultCacheDuration = 3 * time.Second
const maxCachedWebhookInstances = 10000
type Store interface {
ListWebhoooks(instanceID string) (*[]Webhook, error)
// ...
}
type storeWithReadCache struct {
Store // makes struct compliant with the `Store` interface
cacheDuration time.Duration
cacheListWebhooks *ccache.Cache
}
func NewStoreWithReadCache(underlyingStore Store) *storeWithReadCache {
cache := ccache.New(ccache.Configure().MaxSize(maxCachedWebhookInstances))
swrc := &storeWithReadCache{
Store: underlyingStore,
cacheDuration: defaultCacheDuration,
cacheListWebhooks: cache,
cacheListWebhooksSingleFlight: &singleflight.Group{},
}
return swrc
}
// we override the method responsible for the database read
func (swrc *storeWithReadCache) ListWebhooks(instanceID string) (*[]Webhook, error) {
// Check cache first
cachedItem := swrc.cacheListWebhooks.Get(instanceID)
if cachedItem != nil && !cachedItem.Expired() {
return cachedItem.Value().(*[]Webhook), nil
}
// On a cache miss, go to database
// (and suppress duplicate requests with singleflight)
webhooksUntyped, err, _ := swrc.cacheListWebhooksSingleFlight.Do(instanceID, func() (interface{}, error) {
return swrc.Store.ListWebhooks(instanceID)
})
// Cache value
webhooks := webhookUntyped.(*[]Webhook)
swrc.cacheListWebhooks.Set(instanceID, webhooks, swrc.cacheDuration)
return webhooks, nil
}
Rounding up
As great as this all is, nothing is perfect and we should acknowledge the tradeoffs:
Data inconsistency vs greater speed & availability
As mentioned earlier, there’s a small risk of data inconsistency within the 3-second cache window — we lose strong consistency and allow for stale reads. But eventual consistency gives us higher availability (we proceed with stale data if DynamoDB is momentarily unavailable) and we gain greater speed when there’s a cache hit (1000 times faster).
Memory usage
There’s a slight increase in memory usage of the worker. We have a maximum number of instances we cache. Back-of-the-envelope calculations led us to determine that we could see an increase of memory usage up to 50MB.
Cache invalidation can be hard
Caching is usually a tricky subject, with issues such as cache invalidation. However, having this very short Time To Live of 3 seconds means that we keep things simple.
In summary, a straightforward cache goes a very long way 💥.