[Box Backup] BlockSize and large file diffs

Ben Summers boxbackup@fluffy.co.uk
Tue, 15 Nov 2005 17:35:37 +0000


On 15 Nov 2005, at 17:21, Gary wrote:

> Ben,
>
>>>         BlockSize = 10240
>
>> Your script would find that this makes absolutely no difference.
>> These settings affect storage on the server only. Diffing is done on
>> the client.
>
> Hmmm, I was under the impression that the server block size was
> affecting the size of a client size "block" that was used during
> cutting up large files into pieces for client/server comparison. What
> would follow, naturally, would be the numer of block checksums
> downloaded from the server, and the number of comparisons made for  
> each
> large file.
>
> Question: is there any way to control the block size that is used for
> the diffing algorithm? Theoretically, it should be possible to set the
> diffing block size of a 500MB file to, say, 1MB, and have only 500
> blocks to compare (500 * sizeof(checksum) to download), instead of,
> say, 51200 10Kb blocks to compare (51200 * sizeof(checksum) to
> download).
>
> Do I understand the diffing process correctly, here?

I think so.

The block size is chosen automatically for you to balance space  
efficiency and the time it takes to diff. As you suggest, it does use  
larger blocks for larger files.

I don't think it's necessary to make this tunable. If the current  
scheme isn't very good, we should improve the code until it works  
well out of the box rather than ship something which needs tweeking.

If you want to play with the algorithms, look at

    lib/backupclient/BackupStoreConstants.h
    lib/backupclient/BackupStoreFileEncodeStream.cpp

and the function BackupStoreFileEncodeStream::CalculateBlockSizes()  
within the latter file which chooses the basic block size.

Patches for your improved selection algorithm, backed up with real  
world timing tests, are welcome! :-)

Ben