Recently one of our PMs flagged to us that our product analytics data was nearly 9 days behind the production database! We quickly found that this was due to MySQL replication lag. But to fix this lag, we had to dive deep into the internals of MySQL replication! In this post, I’ll show you those internals. Along the way, I’ll point out possible causes of replication lag, such as single-threaded replication, choice of binlog format, and disk flushing policy. I’ll show you how to diagnose the problem in your MySQL setup. Finally, I’ll describe how we fixed the replication lag in our case.
At Pusher, we have a product analytics system which looks like this:
As soon as we noticed that our product analytics were stale, we checked our CloudWatch metrics, and noticed that the “lag” metric for the MySQL replica was increasing indefinitely:
We initially guessed that the replica was 9 days out-of-date because replication had stopped 9 days ago. But that wasn’t the case: the lag had been increasing for 14 days, and replication was still running. The problem was that the replication could not keep up with the changes on the master.
But why? How can we investigate it? How can we fix it? Before we can start to answer those questions, we need to understand the key concepts that make up the internals of MySQL replication1. And before we go into replication, we need to understand the MySQL binlog.
The binlog
MySQL uses a binlog to implement point-in-time recovery. Let’s say you take periodic backups of your database at points in time. RDS snapshots are an example of this. If the last backup was taken at 12:00, and the database was accidentally deleted at 12:27, how do you restore the updates that happened between 12:00 and 12:27?
This is where the binlog2 comes in. The binlog is a series of files
in the data
directory
containing all committed updates to the database in chronological order.
When a MySQL thread is processing a transaction, it will buffer all
write operations until it receives a COMMIT
, then it will write the
updates to the binlog before applying them to the database itself.
When performing a point-in-time restore, the database is first loaded from a snapshot, then all updates in the binlog are applied up to the time that the database state was deleted.
As the name implies, the binlog is in a binary format, but you can view
it as ASCII SQL statements with
mysqlbinlog
.
Here’s an example of running it on a binlog on my machine with
mysqlbinlog /var/lib/mysql/mysql-bin.000001
:
# at 1476
#200309 17:06:29 server id 1 end_log_pos 1583 CRC32 0xc20ce1f9 Query thread_id=1 exec_time=0 error_code=0
SET TIMESTAMP=1583773589/*!*/;
BEGIN
/*!*/;
# at 1583
# at 1622
#200309 17:06:29 server id 1 end_log_pos 1622 CRC32 0x068444cd Rand
SET @@RAND_SEED1=313260617, @@RAND_SEED2=516001561/*!*/;
#200309 17:06:29 server id 1 end_log_pos 1785 CRC32 0xb45a622d Query thread_id=1 exec_time=0 error_code=0
SET TIMESTAMP=1583773589/*!*/;
DELETE FROM test WHERE id BETWEEN 1 AND ROUND((RAND() * 4)+1)
/*!*/;
# at 1785
#200309 17:06:29 server id 1 end_log_pos 1816 CRC32 0x9c071f47 Xid = 53
COMMIT/*!*/;
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
This is valid SQL that you could run by piping it to the mysql
client.
You can even read the binlog from a running MySQL instance with the
--read-from-remote-server
option.
We won’t go in to detail about all the metadata in it, but instead focus on the three main formats that binlogs can be written in. These will be relevant to our replication lag investigation.
STATEMENT
format
The simplest format is STATEMENT
: the binlog contains the exact
updates that are applied to the database. If you look closely at the
above binlog output, you will see the precise statement I ran that
deletes some random keys:
DELETE FROM test WHERE id BETWEEN 1 AND ROUND((RAND() * 4)+1);
So why would we need any other format? The disadvantage with STATEMENT
form is it’s non-deterministic3. A statement can apply different values
depending on where or when it is run. For example it might depend on the
current DATE()
.
If you want the binlog to apply exactly the same values that were
applied to the database, you can set the binlog to ROW
format. ROW
format translates transactions into a series of statements that only
update a single row with concrete values.
ROW
format
Now if we run mysqlbinlog
on a binlog containing the above query,
instead of the DELETE
statement, we get this mysterious statement
instead:
BINLOG '
aXpmXhMBAAAAPQAAAL8CAAAAAEcAAAAAAAEAEnB1c2hlcl9kZXZlbG9wbWVudAAEdGVzdAABAwAA
mTdwwA==
aXpmXiABAAAALQAAAOwCAAAAAEcAAAAAAAEAAgAB//4CAAAA/gUAAADVbbXb
';
To base64 decode this we can run mysqlbinlog
with the options
--base64-output=decode-rows --verbose
, which gives us the output:
### DELETE FROM `pusher_development`.`test`
### WHERE
### @1=2
### DELETE FROM `pusher_development`.`test`
### WHERE
### @1=5
giving us a textual representation of the exact modifications it will make.
The disadvantage with ROW
binlog format is the binlogs become much
larger. For example a single query to update one million records would
result in one million entries in the binlog.
MIXED
format
MySQL has a third format, MIXED
, which helps solve this problem. It
uses STATEMENT
format by default, but switches to ROW
for
non-deterministic updates.
RDS MySQL logs in MIXED
format by default. It can be configured by
setting the binlog-format
variable.
MySQL replication
Why did we just spend all this time talking about point-in-time recovery and the binlog? It’s because replication works in a very similar way.
Typically when you attach a replica to an existing database, you will first restore the state from some point-in-time snapshot. To “catch up” with the state on the master, the replica will connect to the master and start applying statements from the binlog. The replica will continue to apply these updates indefinitely.
Before discussing the process in more detail, let’s extend our diagram with some new components.
When a replica connects to the master, the master will start up a “binlog dump” thread (1 per replica), which reads from the binlog and asynchronously sends updates to the replica. The replica starts up two threads, the I/O thread, which reads updates from the binlog dump thread on the master and writes them to a relay log. The relay log is a series of files in the same format as the regular binlog. The second thread is the SQL thread which reads the relay log and applies the updates to the database.
The I/O thread keeps track of its current position in the master’s binlog by storing the current file and offset in “status log” files. That way the replica can resume replication if it disconnects or if the process is restarted.
Note that replication is single threaded. There is only a single instance of the I/O thread and the SQL thread. The updates are not applied in parallel like they might be on the master.4 We suspected early on that this was our problem, and looked into fixing it with parallel SQL threads. However, this wasn’t our eventual fix. First, this feature only parallelizes across different databases, but we only use a single database. Second, we found a different cause of lag that we were able to fix: read on!
Isolating which thread is struggling to keep up
Now we understand enough concepts to start investigating replication lag.
The first thing to do is figure out which of these threads is struggling to keep up. You can do this by working out which binlog file, and position within that file, each thread is currently processing. In a system with no lag, the binlog files and positions should be close to identical.
There are some helpful commands that tell us where in the binlog each thread is up to5.
On the master run
mysql> SHOW MASTER STATUS\G
*************************** 1. row ***************************
File: mysql-bin-changelog.076928
Position: 27998964
1 row in set (0.08 sec)
This shows the current binlog file the master is writing to, and the current position within that file.
On the replica run
mysql> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Log_File: mysql-bin-changelog.076928
Read_Master_Log_Pos: 32312254
Relay_Log_File: relaylog.053851
Relay_Log_Pos: 14281377
Relay_Master_Log_File: mysql-bin-changelog.076928
1 row in set (0.08 sec)
Master_Log_File
and Read_Master_Log_Pos
show the current binlog file
on the master that the I/O thread is currently up to. If this is behind,
then the I/O thread is struggling to keep up.
Relay_Log_File
and Relay_Log_Pos
show the current position in the
relay logs that the SQL thread is reading from. This is less interesting
because it doesn’t relate to the position in the binlog. What is more
interesting is the Relay_Master_Log_File
which shows the binlog file
that contains the statement that the SQL thread last executed. If this
is behind then it means the SQL thread is struggling to keep up.
Putting it all together to isolate the lag
In all cases it is worth checking there is no obvious resource saturation (CPU, memory) on the master and replica. The replica should also be a machine with the same resources as the master.
If the I/O thread is slow it likely means the network connection between the master and replica is unreliable or saturated.
In our case it was the SQL thread that was struggling. Unfortunately there can be a number of causes of this, and it can be tricky to isolate. This blog post has some good recommendation of things to look into.
During our investigation we noticed that the lag started soon after attaching the third-party MySQL instance (Fivetran) as a replica of our replica (you can daisy-chain replication like this). However it wasn’t immediately obvious what specifically about this would cause our master<->replica process to run more slowly.
By default replicas don’t write to their own binlog6, however by attaching a replica to it, the source replica will start writing to a binlog.
However writing to a binlog didn’t explain why it couldn’t keep up - after all, the master was writing to a binlog and was keeping up with writes with the same hardware.
After some investigation it turned out that the Fivetran MySQL instance
required us to send the replication events in ROW
format.
The master was using MIXED
format. This meant the SQL thread on the
replica was performing significantly more disk writes than the master.
We use AWS EBS as the disk, so these additional writes were particularly
expensive as they are made over a LAN.
We were still under our provisioned EBS IOPS, but since all the writes were coming from a single threaded process, we couldn’t effectively make use of any parallelism in the OS or hardware.
So how did we solve the problem?
We realized that by default, RDS MySQL flushes the binlog updates to disk on each write. This makes a lot of sense for point-in-time recovery: if the database crashes before the binlog entry is written to disk, then that write will not make it into a new database if a point-in-time restore is performed.
However for our replicas we don’t need synchronous writes to disk. If we
want to restore a replica, we will do that from a master snapshot.
Luckily the MySQL
sync_binlog
option lets you turn off flushing on each write, and instead
leaves it to the OS.
You can set this in the replica’s MySQL config file (e.g. /etc/mysql/my.cnf
) with
[mysqld]
sync_binlog=0
# ...
Or dynamically by running the query
SET GLOBAL sync_binlog=0;
This is what we did, and our database rapidly crunched through all of the lag it had accumulated!
Thanks to Jim Fisher for reviewing this and helping with the introduction.
Footnotes
-
We were running MySQL 5.6 on RDS with standard MySQL replication (not GTID-based or synchronous replication with NDB Cluster). ↩
-
Note: the binlog is distinct from the write-ahead log or “redo log” that InnoDB uses. ↩
-
The
RAND()
function is actually made deterministic by setting a random number generator seed in the binlog itself. You can see this in the example binlog output I shared:SET @@RAND_SEED1=313260617, @@RAND_SEED2=516001561/*!*/;
. More information in the MySQL docs. ↩ -
This is due to the complexity of implementing concurrent writes such that the data on the replica is consistent with the master: how you would ensure the transactions are applied in the same order? This blog post discusses this difficulty in more detail. ↩
-
I omitted some of the irrelevant lines in the output of these commands. ↩
-
You can manually configure replicas to write to a binlog with
log_slave_updates
. ↩