[Box Backup-dev] Re: [Box Backup] Re: Block Sizes and Diffing

Ben Summers boxbackup-dev@fluffy.co.uk
Wed, 26 Sep 2007 12:24:53 +0100


Chris Wilson wrote:

>>> Is this related to chromi's diffing optimisation? Is he around to  
>>> help
>>> us?
>>
>> Chromi's work just optimised the original algorithm. There isn't very
>> much you can do to tweak this, but I'm not sure anything which has  
>> been
>> observed is a problem.
>
> I don't understand then why it chooses a different block size each  
> time,
> and the block sizes are highly variable. Surely having to re-reead the
> same file 128 times to calculate the checksums of each block at the  
> 128
> most common block sizes isn't reasonable? Is there anything we can  
> do to
> reduce the number of block sizes in a file, or at least the number of
> times that the file is rescanned?
>
> I think I saw in the code a hard-coded limit on the number of block  
> sizes
> that it will check, based on the most frequently-occuring block  
> sizes in
> the file. But it allows up to 128 different block sizes (it find  
> the 128
> most common) which is still a lot of re-reads (not as bad as 1216, but
> still bad IMHO).
>
> I also don't understand why the block sizes aren't all powers of two,
> as the block-size selection algorithm seems to double the size every
> time, but that's probably because I'm misreading the code. If it  
> was true,
> then it should limit the number of block sizes in the file anyway.  
> In this
> case, there are probably no blocks larger than 4k, so the 12 powers
> of two up to 4096 should be the only 12 block sizes found, which  
> would be
> much more reasonable.

Consider a 128k text file which uses 4k blocks. Insert a character  
somewhere in the middle of a block, then back it up again. You've now  
got a gap which isn't covered by any of the existing blocks, and it's  
4097 bytes long. How can a non-standard block size be avoided without  
bandwidth inefficiency or knowledge of the key on the server side?  
What do you do when it gets lots of little inserts over hundreds of  
backup runs?

Benefits have costs. Tradeoffs have to be made.

Ben