A couple of weeks ago I attended SREcon EMEA 2019 in Dublin. I felt like I learned a lot over the (exhausting!) 3 days, 19 talks and 2 workshops I attended, so I wanted to write up my experience.
Before we get into the conference itself, you may be wondering “what is Site Reliability Engineering anyway?”
Site Reliability Engineering was a discipline defined by Google that is a combination of engineering and operations. They wrote a popular book about it that you can read online. You might be wondering how the SRE role is different from “DevOps Engineer”, “Cloud Infrastructure Engineer” or “Sysadmin”. You might be wondering how it relates to regular software engineering. I don’t think the definition is clear. People I spoke to at the conference weren’t clear either! There was some lively discussion about this. In his keynote talk The SRE I aspire to be, Yaniv Aknin said “software engineers come to work to build features, SREs come to work to measurably optimise cost/reliability”. Maybe that’s helpful. I don’t know. I don’t think it matters too much.
Here are some recurring themes I picked up on throughout the conference.
Has the field re-invented the wheel?
A core theme of the conference was, of course, reliability. SRE is a relatively young field, and in that short term it has defined a number of tools for quantifying reliability: SLIs, SLOs, error budgets.
However it’s not the only field that is concerned with reliability. There are other, older, fields where reliability is a core concern. Think: aerospace, defence, nuclear, etc. These have also developed their own tools for quantifying reliability: STAMP, STPA, Fault Tree Analysis, IEC 61508, and many other frameworks.
There were a number of talks, particularly the keynotes which focused on the topic of reliability in these other fields. This made me think: while the early phase of SRE has been characterized by defining its tools from first principles, the next phase might involve an exploration of the tools used in other fields. How do the SRE tools compare, and what can we learn from other fields?
I found Fault Tree Analysis really interesting and potentially useful. It’s a tool for calculating the availability/data durability of a system based on the availability/durability of the constituent parts. Unfortunately it cannot be used to derive the latency of a system. I’d recommend watching the talk.
What’s next on the horizon for monitoring tools?
We now have a myriad of systems available for metrics aggregation, log aggregation, tracing, and alerting. What tools are coming next over the horizon?
- Automatic anomaly detection. Currently our systems spit out a vast number of metrics and logs. It’s up to engineers to figure out how to set up alerts based on these if anything unusual happens. This is tricky to get right. A lot of providers are focussing on automatically detecting anomalies in log/metric data out of the box. I noticed that almost all the “APM” SaaS companies (e.g. Datadog, Instana, StackState) in the sponsor area listed this prominently as one of their hot new features.
- Automatic correlation detection. A firing alert will only tell you the symptoms of an issue. These symptoms might be far removed from the root cause. The chances are if something is wrong, then a large number of metrics will be affected. You don’t necessarily want to see all firing alerts; you just want the alert that is closest to the root cause. Efforts are being made to perform this “metric correlation” automatically.
These features were often grouped together under the term “AIOps”.
9s don’t matter if users aren’t happy
Availability SLIs are normally defined as the percentage of “good” events divided by the total number of events. An SLO defines the target SLI value over a period of time. For example, an SLO against service uptime might measure the percentage of successful requests out of the total number of requests, over a month. An SLO is often >99%, e.g. 99.99%. That’s where the “9s” come from.
As Yaniv Aknin described in his talk The SRE I aspire to be, defining the right SLOs that align with business and user needs is tricky. Picking the right definitions is tricky. For example, what is a “failed” request? For latency SLOs, which percentile should you target? SLOs are often set at at least the 99th percentile, but users tend to care more about the median latency. Narayan Desai in his talk The Map Is Not the Territory: How SLOs Lead Us Astray, and What We Can Do about It had some interesting suggestions here.
Because of this, they are not a substitute for customer feedback. You might have 99.9999% uptime and your users are still not happy, or you might have 97% uptime and that is good enough. User’s reliability expectations and requirements change over time. You need to repeatedly take user’s feedback on board and tweak SLOs based on this.
John Looney in his talk A Customer Service Approach to SRE hammered home this point. He argued that ops people are really customer service people, and the same principles apply. Good customer service is often a better investment than improving reliability. Good customer services wins forgiveness. If the services has issues, users will have the confidence that you will solve it. They might even help you and give you useful feedback.
Napkin math for capacity planning/SLO estimation
SRE work often involves answering questions like:
- Will this system be able to handle 100k queries per second?
- What is the probability that this system will drop a message?
These are tricky questions to answer because the systems are complex. Each component has its own capacity and reliability characteristics that need to be combined in order to answer questions of the system as a whole.
To answer these questions accurately, you have to build the system and load test it empirically. This is incredibly expensive. You generally want to start by going much wider with the options you consider when designing a system. To do this you need to evaluate options cheaply. A number of talks I attended advocated starting with a rough estimation that should get you within an order of magnitude of the correct value, aka napkin math.
Simon Eskildsen’s talk Advanced Napkin Math: Estimating System Performance from First Principles is worth watching if you want to see an example of combining base-rates to estimate system performance.
The same principles also featured heavily in a workshop on Non-Abstract Large System Design. The workshops weren’t recorded, but the NALSD chapter in the SRE book covers the same content.
A common theme in “experience report” talks I went to was the problem of tool proliferation: the large number of components that make up monitoring systems these days. It’s operationally complex managing all of these tools, but it’s particularly unpleasant from a user experience perspective.
This resonated with me because we suffer from this problem at Pusher. Off the top of my head, we make use of Nagios, Librato, Heroku monitoring, New Relic, Papertrail, Sentry, Cloudwatch, Prometheus, Kibana, Grafana,…
Molly Struve in her talk Building a Scalable Monitoring System described precisely this problem. She mentioned how her team wrapped all of these tools with Datadog. While this meant adding yet another tool to the stack, at least the data could be access through a single uniform UI.
Unfortunately there doesn’t seem to be a single system that solves all problems well, so for now I don’t think this tool proliferation problems is a fully solved problem.
- Latency SLOs Done Right by Heinrich Hartmann – I also went to his workshop Statistics for Engineers. Basic statistics comes up frequently in the SRE field. My statistics knowledge is rusty so I really enjoyed attending this workshop! Unfortunately the workshop wasn’t recorded, but the slides are online and the tutorials are on Github.
- Pushing through Friction by Dan Na – A well-delivered talk on tips for pushing through organizational friction. This twitter thread sums up the content better than I could.
- How to SRE When Everything’s Already on Fire by Alex Hidalgo and Alex Lee – Another well-delivered talk with lots of suggestions on how you can incrementally bring in long term investments while fire fighting.
- Fault Tree Analysis Applied to Apache Kafka by Andrey Falko – I already mentioned this. It seems like an interesting tool. I’d like to try it out.
Random thoughts and observations
- The impression I got from most speakers was that microservice architectures are always better than monoliths. I think this is more controversial outside of the SRE bubble. I think this opinion could stem from the fact that microservice architectures might make more sense for companies that are large enough to hire dedicated SRE teams.
- The concept of do-nothing scripts were mentioned a few times, but I don’t think people know what to call them. This is something we’ve been adopting in the Channels team at Pusher. Naoman Abbas in his talk Evolution of Observability Tools at Pinterest suggest writing these do-nothing scripts for managing on-call incidents and linking to them from monitoring dashboards. This seems like a good idea.
- Something I think we’ve struggled with is how to prioritize infrastructure investments vs customer-facing work. I also don’t think we’ve figured out how supporting teams with internal customers (e.g. teams building shared infrastructure) should prioritize work in a systematic way. In his talk How Stripe Invests in Technical Infrastructure, Will Larson gave a number of suggestions on this topic.
Tools worth looking into
There were a bunch of monitoring/alerting/APM SaaS products on show at the sponsor booths and also mentioned in talks. I was demoed a few of these. I was most impressed by https://www.instana.com/. https://www.datadoghq.com/ also seems to be shipping more and more useful features. I’d like to try it out as a New Relic alternative. I was interested to see that https://www.humio.com/ stores log data without using an index. They argue that this allows them to ingest logs at a very high throughput (makes sense). They argue that queries are still fast even with GBs of log data. It seemed fast in a demo. Interesting…
Word of the conference
nown [ U ]
Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.
Classic examples are: manually scaling clusters; manually replacing boxes; processing server logs.
The first step to eliminating toil is to measure it. Otherwise how can you prioritize investments that help eliminate it? Unfortunately there wasn’t much advice given here. We have tried keeping a maintenance log in the Channels team at Pusher, but this take a huge amount of discipline to keep up. Any ideas?
I’d recommend this conference. I learned a lot, and it’s very relevant to the kind of work we do at Pusher. I like the fact that it is technology-agnostic.
It was the first conference where I spent a significant amount of my time at workshops as well as talks. I think I learned more this way. I’d do that again next time.
I left convinced about the importance of defining SLOs - particularly for a SaaS company like Pusher, where our customers often depend on our APIs for business critical parts of their applications.
I think that reliability is our most important feature. However, without reliability metrics and targets it is very hard to prioritize investments into this area. SLOs are a great tool for solving this problem in a way that aligns with user needs. They are also a concise and powerful tool for communicating relative reliability requirements of each feature to the rest of the business.