[Box Backup-dev] Re: [Box Backup] Re: Block Sizes and Diffing

Tue, 25 Sep 2007 21:36:34 +0100 (BST)

Hi Ben,

On Tue, 25 Sep 2007, Ben Summers wrote:

>> > No, it picks an "appropriate" block size for each chunk that it 
>> > detects has changed. Personally I don't think this is particularly 
>> > smart, I think we should keep the same block size for the whole file.
>> 
>> What do you think about the idea of reducing the number of possible 
>> block sizes in a file? Please could you explain to this bear of little 
>> brain how the block-size selection algorithm is supposed to work for 
>> patches, and why we allow so many different block sizes in the same 
>> file?
>
> A preferred block size is chosen for a file. The algorithm identifies a 
> 'gap' which cannot be covered by existing blocks from the 64 most common 
> block sizes. Blocks of the preferred block size are used to cover the 
> gap, with the space at the end with either a smaller block or a slightly 
> larger block, as appropriate.
>
> Sizes of existing blocks cannot be adjusted nor can the contents of 
> blocks because of the need to conserve bandwidth use and that the 
> encryption means that the contents are opaque to the server.
>
>> Is this related to chromi's diffing optimisation? Is he around to help 
>> us?
>
> Chromi's work just optimised the original algorithm. There isn't very 
> much you can do to tweak this, but I'm not sure anything which has been 
> observed is a problem.

I don't understand then why it chooses a different block size each time, 
and the block sizes are highly variable. Surely having to re-reead the 
same file 128 times to calculate the checksums of each block at the 128 
most common block sizes isn't reasonable? Is there anything we can do to 
reduce the number of block sizes in a file, or at least the number of 
times that the file is rescanned?

I think I saw in the code a hard-coded limit on the number of block sizes 
that it will check, based on the most frequently-occuring block sizes in 
the file. But it allows up to 128 different block sizes (it find the 128 
most common) which is still a lot of re-reads (not as bad as 1216, but 
still bad IMHO).

I also don't understand why the block sizes aren't all powers of two, 
as the block-size selection algorithm seems to double the size every 
time, but that's probably because I'm misreading the code. If it was true, 
then it should limit the number of block sizes in the file anyway. In this 
case, there are probably no blocks larger than 4k, so the 12 powers 
of two up to 4096 should be the only 12 block sizes found, which would be 
much more reasonable.

Any ideas?

Cheers, Chris.
-- 
_____ __     _
\  __/ / ,__(_)_  | Chris Wilson <0000 at qwirx.com> - Cambs UK |
/ (_/ ,\/ _/ /_ \ | Security/C/C++/Java/Perl/SQL/HTML Developer |
\ _/_/_/_//_/___/ | We are GNU-free your mind-and your software |