[Box Backup-dev] Re: [Box Backup] Re: Block Sizes and Diffing

Ben Summers boxbackup-dev@fluffy.co.uk
Tue, 25 Sep 2007 12:14:22 +0100


>
Chris Wilson wrote:

> On Mon, 10 Sep 2007, Chris Wilson wrote:
>
>> Hi Johann,
>>
>> On Mon, 10 Sep 2007, Johann Glaser wrote:
>>
>>>  The output consists of 1216 lines and starts with:
>>>     642     1069 this  s=      49
>>>     407      107 this  s=     177
>>>     336     1065 this  s=    1633
>>>     296     1363 this  s=    1649
>>>     277     1159 this  s=    1617
>>>     265     1424 this  s=    1681
>>>     254     2706 this  s=    1601
>>>     248     1011 this  s=    1665
>>>     248     1005 this  s=      33
>>>     246     1015 this  s=    1745
>>>     219     1027 this  s=    1729
>>>     205     1226 this  s=    1697
>>>     200     1006 this  s=    1713
>>>     194     1012 this  s=    1585
>>>     156     1303 this  s=    1569
>>>     155     1014 this  s=    1761
>>>     141     1016 this  s=    2017
>>>     114     1020 this  s=    2033
>>>     111     1246 this  s=    1553
>>>     103     1025 this  s=    1777
>>>  (and all following are <100 for the first column)
>>
>> Ouch! 1216 different block sizes in the same file!
>>
>> Ben, I think we need to fix the diffing algorithm. This doesn't seem
>> reasonable to me.
>>
>>>  Unfortunately I don't understand BoxBackup's diffing and block-size
>>>  algorithms, so I don't know what to conclude from my above  
>>> listing. :-)
>>>
>>>  Do I understand correctly, that BoxBackup tries to find the  
>>> smallest
>>>  possible block size to transmit (and store) changes?
>>
>> No, it picks an "appropriate" block size for each chunk that it  
>> detects has
>> changed. Personally I don't think this is particularly smart, I  
>> think we
>> should keep the same block size for the whole file.
>
> What do you think about the idea of reducing the number of possible  
> block
> sizes in a file? Please could you explain to this bear of little brain
> how the block-size selection algorithm is supposed to work for  
> patches,
> and why we allow so many different block sizes in the same file?

A preferred block size is chosen for a file. The algorithm identifies  
a 'gap' which cannot be covered by existing blocks from the 64 most  
common block sizes. Blocks of the preferred block size are used to  
cover the gap, with the space at the end with either a smaller block  
or a slightly larger block, as appropriate.

Sizes of existing blocks cannot be adjusted nor can the contents of  
blocks because of the need to conserve bandwidth use and that the  
encryption means that the contents are opaque to the server.

>
> Is this related to chromi's diffing optimisation? Is he around to  
> help us?

Chromi's work just optimised the original algorithm. There isn't very  
much you can do to tweak this, but I'm not sure anything which has  
been observed is a problem.

Ben