Friday, December 10, 2010

Drizzle Transaction Log and Statement Rollbacks

First off, apologies to those who left comments on my past posts and your comments didn't show up. Apparently, I didn't have my blog settings setup correctly, and I just noticed (and approved) them.

So for the past few weeks, Patrick Crews and myself have been trying to chase down a really nasty transaction log bug that has been showing up under heavy load. The transaction log would end up not accurately representing the changes that took place in the server. Obviously, that's bad... unless you don't care if your slaves are an accurate representation of your master.

This has been a crazy difficult bug to track down, partly due to the fact that the cause appears to be multiple bugs. We've identified at least two bugs that are leading to transaction log inconsistency, one of which is how we deal with SQL statements that fail mid-execution. (The other, we believe, has to do with savepoints, and we're still working on that one.)

For the statement rollback bug, there were two problems:
  1. We were not correctly removing recorded row changes when a statement affecting multiple rows failed after changing the first row.
  2. Because we can now segment a single statement's row changes across multiple Transaction messages (due to having a max message size), we had no way to rollback transaction log changes made by a failed statement once it had been split up into multiple messages.
For issue #1, consider this statement:
INSERT INTO t1 (not_null_column) VALUES ('abc'),('def'),(NULL);
As each row is inserted, the in-memory protobuf message recording the transaction row changes is updated to include the row inserts. As we insert into this column which does not allow NULLs, we'll get changes recorded for values 'abc' and 'def'. However, once we try to insert NULL, the statement fails and InnoDB rolls back the two inserted rows. These two rows were not being removed from the in-memory message structure, which later gets sent through the replication stream and written to the transaction log, thus the inconsistency. The fix is fairly straightforward in that we just recognize the error, then remove the entries before sending the protobuf message downstream.

For issue #2, the situation is similar to issue #1, except that once row changes were sent through the replication stream (i.e., recorded in the transaction log file), we had no way to undo them if the current statement was still active and ended up failing. We decided to resolve this by adding a new protobuf Statement message type: ROLLBACK_STATEMENT.

The implications of this new Statement type? The first is that we now need to be able to clearly identify statement boundaries so we know how much work to undo. That means we no longer "roll up" like SQL statements into the same protobuf Statement message as an optimization for space (this avoided repeating the header information for the Statement message). So now each row-changing SQL statement executed gets its very own protobuf Statement message.

The second implication is that readers of the transaction log file need to be able to either:
  • rollback statements (you could conceivably use savepoints for this), or
  • cache Statement messages until the end segment is seen (which it can then execute the changes) or discard the cached messages if the next read Statement is a ROLLBACK_STATEMENT type.
These changes are now in our trunk. And major props to Stewart who helped us identify and fix this bug using his storage_engine_api_tester plugin.