Here at Pusher we’re always trying new approaches to web development and engineering to uncover ways of making our software more reliable. That’s why I’ve recently been using QuickCheck to fuzz test our Raft implementation with some great results.
In this blog post, I want to demonstrate how this is an an effective technique for testing distributed/concurrent systems, what makes it challenging and how you can do it yourself.
Introduction to QuickCheck
QuickCheck is a property based testing framework that lets you define invariants, or properties, about your code. When run, it generates lots of random input data, which it runs your code against and checks that none of the properties are violated.
There are many QuickCheck tutorials on the web, but they often only
focus on very contrived examples, e.g.
list == reverse (reverse list).
This is an easy-to-grasp demonstration of what it can be used for, but
we’ve found it quite uncommon that you would need to test something like
this in practice (testing encoders/decoders is the exception to this).
It is a shame that these kind of examples are the most common. I would argue that QuickCheck is actually more useful for testing messy real-world systems because there tends to be large numbers of potential edge/failure cases.
We are in the process of developing an implementation of the Raft consensus algorithm as part of a larger system written in Haskell. For more information on Raft in general, I recommend you follow the linked resources on the official website.
Distributed systems are notoriously hard to test. Because nodes in a cluster will process instructions concurrently, there are factorially many permutations of orders in which these can be executed in. Each one of these permutations can have its own unique bugs and edge cases. Testing all of these manually is impossible.
Short of formally proving correctness (which the authors of the Raft paper did do), the next best thing is performing fuzz testing. Rather than testing with constant input, fuzz tests randomly generate input each time they are run. The main advantage of this is that the large amount of input data is likely to cover edge cases that the programmer did not consider.
Generating random input data like this is central to how QuickCheck works. A significant advantage of QuickCheck and Haskell over other fuzz testing techniques is that tests are completely reproducible; the caveat is that this is only true if the code you are testing does not have side effects on parts of the system that is not being tested.
QuickCheck was a nice fit for testing Raft because the Raft paper already lists five properties (See figure 3 from the paper) that a correct Raft implementation cannot violate.
These properties could be directly converted into QuickCheck properties, which is great because one of the trickiest things about using QuickCheck is coming up with the properties in the first place.
How we tested
The idea was to simulate running a Raft cluster against arbitrary input commands, while checking the Raft properties. At a high level, the steps involved are:
- Start multiple instances of Raft in an initial state
- Generate a list of arbitrary Raft commands (
- Iteratively feed each command to one of the Raft instances
- Run the instance on the command
- Check no properties were violated
Implementing this turned out to be challenging for a number of reasons.
A deterministic scheduler
Raft nodes are meant to be run concurrently, but we need to run them sequentially if we want reproducible results. In order to solve this, we generated a “schedule” up front as a QuickCheck arbitrary value. This schedule was essentially just a repeating list of node IDs in a random order. At each iteration in the simulation, the next node ID will be taken from the head of this, and this would be the node that is run.
In order to keep things deterministic, all other IO had to be mocked. The IO required by Raft is network communication between nodes, and also timeouts within nodes.
To mock the network, we wanted to make it as realistic as possible, so rather than writing messages straight to the destination node, the mock connection had to deliver messages non-instantaneously and should also randomly disconnect. This was achieved in a similar way to the schedule: the boolean decision of whether to deliver a message or cause a network disconnection/reconnection could be generated as arbitrary QuickCheck values.
Note: our Raft implementation assumes TCP is the transport, so we did not need to mock dropping of messages or reorderings because TCP guarantees that this will not happen.
Raft also has the concepts of timeouts. To make these deterministic, we had to deliver the timeouts back to the shard to be handled after handling an arbitrary number of other commands.
The algorithm is also stateful, and the properties make reference to this state.
There are two approaches we considered for handling this:
- Run simulations, and save each state to a list. Check the properties afterwards.
- Run simulations using monadic QuickCheck. This allows assertions to be interleaved with other effects and also allows arbitrary values to be picked at any point.
In the end we decided on the second option because it seemed simpler to
check the properties during execution. However, It would be interesting
to try out the other technique, because it would be a nice separation of
concerns, and would also mean that the properties could be defined as
Propertys, rather than in the
In the end we managed to find 14 different bugs in our implementation.
Here are two examples of these bugs, and the workflow required to fix them.
Some bugs lead to violations of properties, and others simply lead to the program crashing (because it entered an invalid state).
Property violation example
Raft is based around keeping a log of state machine updates in sync across nodes.
When running the tests, QuickCheck alerted to us that the Leader Completeness property had been violated. This property states that “if a log entry is committed in a given term, then that entry will be present in the logs of the leaders for all higher-numbered terms.”
Looking at the debug logging, it appeared that after an election, the new leader may not have the last entry in its replication log. After some investigation it appeared that when a disconnection from the leader was handled in the follower state, the term was not being incremented. This meant that the follower may be elected even if it contained an incomplete log. Bumping the term in this scenario fixed this issue.
Invalid state example
Other bugs lead to an error being thrown when a certain situation should not arise. An example bug was identified when Raft complained that a new log ID it was required to add to its replication log was not contiguous with the existing entries.
After looking into our code, we found this was being checked with by the following conditional:
if currentTerm >= term entry && currentLogID == logID entry + 1 then -- Add the entry to the log
Reading the Raft paper more closely, it became clear that this was too restrictive, and log entries may not be contiguous if the term is greater. The correct check was:
let -- If it's in a higher term, then the ID can already exist contiguousInHigherTerm = currentTerm > term entry && currentLogID <= logID entry + 1 -- In the same term it must occupy the next logID contiguousInSameTerm = currentTerm == term entry && currentLogID == logID entry + 1 if contiguousInHigherTerm || contiguousInSameTerm then -- Add the entry to the log
The problem with these bugs is they are easy to overlook when writing the code (no Haskell compiler errors here!). They are only likely to crop up with the large numbers of disconnections/delays/timeouts that are generated when running these QuickCheck tests.
QuickCheck lets you configure the number of tests to run, each against different input data. For something simple, 100 or so runs should be sufficient. For a distributed system with a huge number of combinations of schedules/type of input commands/network disconnections/network delays/timeouts/schedules, a particular bug may only surface in a small subset of these. For example some of the bugs we found only occurred once in ~500 tests. It is quite possible that if we ran it for even longer we may find more bugs. The problem is it can take quite long to run, so a good solution would be to run 10k tests daily, and 100 per commit.
The other big consideration is the distribution of the various input data. This is the ratio of publishes of new messages to disconnections to timeouts to cluster membership changes etc. The challenge is in striking the correct balance: too many inputs that will only test the happy cases (e.g. a lot of publishes or adding of new nodes to the cluster) will mean that it is very unlikely that edge cases will be uncovered. On the other hand if there is too much “destabilising input” (e.g. constant disconnections, or removal of nodes) then the cluster will never be in a state where it can communicate between nodes/win elections and therefore will not be in a useful state to actually add any new entries to the replication log.
It’s tricky to predict the correct balance up front, so we found that trying out some educated guesses, running some simulations and looking at the logs to see what actually occurred was a good way of fine tuning these parameters.
This technique does not cover all types of bug. For example we have found one bug subsequently which would never be caught by our QuickCheck tests. Essentially the bug caused the cluster to enter a state where there was no leader, and a new one could never be elected. This meant that no new entries could be added, so no properties would be violated, but it was definitely not operating as we’d like it to!
As you may have noticed from the examples bugs, while QuickCheck is good at finding errors, it is often not easy to track down their root cause. It would often involve a lot of time trawling through logs and checking through the course of events until something occurred which didn’t match up with the Raft paper.
We found QuickCheck an invaluable tool in testing the correctness of our implementation. Distributed systems are extremely hard to test manually – there are simply too many potential failure cases, and they are also very easy to overlook. The fact that we found a significant number of bugs, even when we derived our implementation from the description in the Raft paper demonstrates how hard it is to get these systems correct.
Having said that, it still required a lot of work to set everything up. In particular setting up the mocks, and generating the test data took thought and time to implement. We found it helped if we built up the complexity gradually: start by making assumptions about the mocks/input data e.g.
- mocked network connections deliver messages instantly
- the input data could be generated uniformly
- and only a subset of Raft commands would be tested
More complexity can be iteratively added from there.
The flip side of the challenges we faced while writing these tests is that it forced us to reconsider the assumptions and potentially unhandled cases that we may have missed.
While this blog post focused specifically on Raft, the techniques and considerations should be applicable to other distributed systems or systems with concurrent operations more generally.
We’re still learning in this area, so as ever we are eager to hear about any suggestions or improvements over our current process. Leave us a comment below or find me over on Twitter at @willsewell_.