[Box Backup] File diff inverts block probability? BackupStoreFileDiff.cpp:515

Chris Wilson boxbackup@fluffy.co.uk
Wed, 27 Feb 2008 22:58:55 +0000 (GMT)


Hi Alex,

On Tue, 26 Feb 2008, Alex Harper wrote:

> I've been trying to understand Box's slow diff of some large files I'm
> backing up (compared with rsync).
>
> BackupStoreFileDiff.cpp:515 runs a loop of the various server side block
> sizes in reverse order, with the comment that its running from smallest to
> largest. I suspect this is no longer necessary, as it contradicts the
> comments at line 858.
>
> FindMostUsedSizes() seems to produce a block size array in 
> block-size-frequency order, not size order. Thus running the loop in 
> reverse would run from least common block size to most common block 
> size. For a mostly unchanged file stored mostly as large blocks with 
> just a few small blocks, this seems to eliminate most of the 
> goodnessOfFit[] short-circuiting (there is a separate optimization miss 
> with goodnessOfFit, but I'll leave that for another day).
>
> Reversing the loop seems logical, passes unittests and seems to work for 
> me (so far, not a lot of testing). Speedup is 15-20% for my diffs.
>
> Is this a reasonable read of the code? Have I misunderstood something 
> fundamental about the relationship between local block scans and 
> server-side storage?

Many thanks for looking into this and posting such a detailed and helpful 
report. I'm afraid I didn't write that code and I don't know it as well as 
I should, and I'm afraid I don't completely understand how it works. In 
particular, I have no idea whether scanning least or most frequently used 
block sizes first should actually minimise the amount of data to be 
transferred.

I can understand reasons for scanning largest blocks first, but the 
original code didn't appear to do that either, quite the opposite. Given 
the limit on the number of matching blocks that we'll use in the recipe, 
I'd say it makes most sense to scan largest block sizes first.

I'm afraid I can't either refute or support your assertion that changing 
the loop order results in faster diffs or transfers at this time. I'm not 
against doing it, although I'd like to see the new code tested heavily 
before going into a release.

We currently re-read the file once for each block size that we intend to 
try. If we started by reading the most commonly-used block sizes instead 
of the least, as you propose, then potentially we could find all our 
matches in most commonly-used block sizes and reduce the number of times 
that we have to re-read the file (i.e. minimise local disk I/O).

If I understand you correctly, the change you propose has a negligible or 
uncertain impact on network bandwidth use, and does reduce disk I/O, so I 
can see that as an argument in favour of your suggestion.

But I think it would be even better to calculate weak checksums for all 
block sizes at the same time, during a single pass, and then make a second 
pass to identify which of those are real second-level checksum matches. 
The main disadvantage would be potentially storing a lot more matches in 
memory until the recipe is computed.

I also think that this code should be rewritten to be optimised for 
readability :-) Then we could all argue about ways to improve it on the 
basis of a real understanding of what it actually does now :-)

Cheers, Chris.
-- 
_____ __     _
\  __/ / ,__(_)_  | Chris Wilson <0000 at qwirx.com> - Cambs UK |
/ (_/ ,\/ _/ /_ \ | Security/C/C++/Java/Ruby/Perl/SQL Developer |
\__/_/_/_//_/___/ | We are GNU : free your mind & your software |