[Box Backup] File diff inverts block probability? BackupStoreFileDiff.cpp:515

Alex Harper boxbackup@fluffy.co.uk
Tue, 26 Feb 2008 17:51:50 -0800


Hi,

I've been trying to understand Box's slow diff of some large files I'm
backing up (compared with rsync).

BackupStoreFileDiff.cpp:515 runs a loop of the various server side block
sizes in reverse order, with the comment that its running from smallest to
largest. I suspect this is no longer necessary, as it contradicts the
comments at line 858.

FindMostUsedSizes() seems to produce a block size array in
block-size-frequency order, not size order. Thus running the loop in reverse
would run from least common block size to most common block size. For a
mostly unchanged file stored mostly as large blocks with just a few small
blocks, this seems to eliminate most of the goodnessOfFit[] short-circuiting
(there is a separate optimization miss with goodnessOfFit, but I'll leave
that for another day).

Reversing the loop seems logical, passes unittests and seems to work for me
(so far, not a lot of testing). Speedup is 15-20% for my diffs.

Is this a reasonable read of the code? Have I misunderstood something
fundamental about the relationship between local block scans and server-side
storage?

Alex


--
Alex Harper                                         aharper@foobox.net

"Generally speaking, things have gone about as far as they can reasonably
go when things have got about as bad as they can reasonably get."
                                                          - Tom Stoppard