[Box Backup] File diff inverts block probability?
BackupStoreFileDiff.cpp:515
Alex Harper
boxbackup@fluffy.co.uk
Wed, 27 Feb 2008 16:17:13 -0800
Hi Chris,
> I'm afraid I can't either refute or support your assertion that changing
> the loop order results in faster diffs or transfers at this time. I'm not
> against doing it, although I'd like to see the new code tested heavily
> before going into a release.
Of course, and to be clear, I'm not proposing this be included in 0.11. Its
too fundamental to the reliability of the system. At this point all I'm
trying to decide is if I've totally misunderstood the code. It seems I'm on
the right track.
However, it does bring up an interesting question. Lets say I had a patch to
this code (like this loop change). You can assume it passes unittests,
however, I already know that Box's unittests, while very impressive, don't
cover all corner cases. I'm also just selfish enough with my free time to
not want to have to take on the job of making the file diff unittests
completely comprehensive :)
What's a reasonable expectation for testing the patch? Do individual
well-known active developer branches (like yours) have sufficient numbers of
volunteer users that good coverage happens?
> We currently re-read the file once for each block size that we intend to
> try. If we started by reading the most commonly-used block sizes instead
> of the least, as you propose, then potentially we could find all our
> matches in most commonly-used block sizes and reduce the number of times
> that we have to re-read the file (i.e. minimise local disk I/O).
>
> If I understand you correctly, the change you propose has a negligible or
> uncertain impact on network bandwidth use, and does reduce disk I/O, so I
> can see that as an argument in favour of your suggestion.
Not quite. I was targeting a smaller fix at first, just a reversal of the
loop. In the current implementation reading in block probability order (not
always largest to smallest) doesn't actually change the IO profile. It still
reads every byte at every block size. What it changes is that as block size
probability goes down you spend more time in
RollingChecksum:RollForwardSeveral() rolls (past all previously found big
and high-probability blocks). Since RollingChecksum is less expensive than
the remaining table looks and MD5 calcs the total wall time is reduced.
There is another optimization miss where goodnessOfFit rolls you to a new
offset that itself already has a better goodnessOfFit (goodnessOfFit check
should loop) but that's a larger change, and I chose to roll it into a
broader fix.
> But I think it would be even better to calculate weak checksums for all
> block sizes at the same time, during a single pass, and then make a second
> pass to identify which of those are real second-level checksum matches.
> The main disadvantage would be potentially storing a lot more matches in
> memory until the recipe is computed.
In fact, this is the core of my issue. In my problem case the file is a
1.5GB mail database stored on an encrypted filesystem. The CPU and IO load
of the repeated file reads is huge. Minmum 15 minutes to diff with both
cores running 100% and my laptop fans screaming. When the file content gets
purged from file cache by other machine usage it gets even worse. Capping
diff time isn't a good option since my goal is to not upload 1.5GB every
time.
I agree a one-read-pass diff is the right solution, and the comment at line
488 seems to validate that assumption.
What I left out of my previous message is that I was looking at fixing the
IO issue as well. I didn't mention it then because I don't want to
overpromise, but my first draft passes unittests and runs the same diff 2.5
times faster. Unfortunately, right now it also has readability issues and
produces different (seemingly correct) recipes versus the old code (making
it hard to validate).
I'll continue polishing, and assuming my understanding of block diff
optimization (including block ordering) is correct I may have something
worth talking about eventually. However, its the kind of thing that would
require careful review by someone who understands BackupStoreFileDiff.cpp,
better than I do. I'm hoping a volunteer will emerge :)
Alex
--
Alex Harper aharper@foobox.net
"Generally speaking, things have gone about as far as they can reasonably
go when things have got about as bad as they can reasonably get."
- Tom Stoppard