[Box Backup-dev] Re: [Box Backup] Re: Block Sizes and Diffing
Ben Summers
boxbackup-dev@fluffy.co.uk
Tue, 25 Sep 2007 12:14:22 +0100
>
Chris Wilson wrote:
> On Mon, 10 Sep 2007, Chris Wilson wrote:
>
>> Hi Johann,
>>
>> On Mon, 10 Sep 2007, Johann Glaser wrote:
>>
>>> The output consists of 1216 lines and starts with:
>>> 642 1069 this s= 49
>>> 407 107 this s= 177
>>> 336 1065 this s= 1633
>>> 296 1363 this s= 1649
>>> 277 1159 this s= 1617
>>> 265 1424 this s= 1681
>>> 254 2706 this s= 1601
>>> 248 1011 this s= 1665
>>> 248 1005 this s= 33
>>> 246 1015 this s= 1745
>>> 219 1027 this s= 1729
>>> 205 1226 this s= 1697
>>> 200 1006 this s= 1713
>>> 194 1012 this s= 1585
>>> 156 1303 this s= 1569
>>> 155 1014 this s= 1761
>>> 141 1016 this s= 2017
>>> 114 1020 this s= 2033
>>> 111 1246 this s= 1553
>>> 103 1025 this s= 1777
>>> (and all following are <100 for the first column)
>>
>> Ouch! 1216 different block sizes in the same file!
>>
>> Ben, I think we need to fix the diffing algorithm. This doesn't seem
>> reasonable to me.
>>
>>> Unfortunately I don't understand BoxBackup's diffing and block-size
>>> algorithms, so I don't know what to conclude from my above
>>> listing. :-)
>>>
>>> Do I understand correctly, that BoxBackup tries to find the
>>> smallest
>>> possible block size to transmit (and store) changes?
>>
>> No, it picks an "appropriate" block size for each chunk that it
>> detects has
>> changed. Personally I don't think this is particularly smart, I
>> think we
>> should keep the same block size for the whole file.
>
> What do you think about the idea of reducing the number of possible
> block
> sizes in a file? Please could you explain to this bear of little brain
> how the block-size selection algorithm is supposed to work for
> patches,
> and why we allow so many different block sizes in the same file?
A preferred block size is chosen for a file. The algorithm identifies
a 'gap' which cannot be covered by existing blocks from the 64 most
common block sizes. Blocks of the preferred block size are used to
cover the gap, with the space at the end with either a smaller block
or a slightly larger block, as appropriate.
Sizes of existing blocks cannot be adjusted nor can the contents of
blocks because of the need to conserve bandwidth use and that the
encryption means that the contents are opaque to the server.
>
> Is this related to chromi's diffing optimisation? Is he around to
> help us?
Chromi's work just optimised the original algorithm. There isn't very
much you can do to tweak this, but I'm not sure anything which has
been observed is a problem.
Ben