[Box Backup] Re: Block Sizes and Diffing (was: Re: [Box Backup] error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry)

Chris Wilson boxbackup@fluffy.co.uk
Mon, 10 Sep 2007 21:50:03 +0100 (BST)


Hi Johann,

On Mon, 10 Sep 2007, Johann Glaser wrote:

> The output consists of 1216 lines and starts with:
>    642     1069 this  s=      49
>    407      107 this  s=     177
>    336     1065 this  s=    1633
>    296     1363 this  s=    1649
>    277     1159 this  s=    1617
>    265     1424 this  s=    1681
>    254     2706 this  s=    1601
>    248     1011 this  s=    1665
>    248     1005 this  s=      33
>    246     1015 this  s=    1745
>    219     1027 this  s=    1729
>    205     1226 this  s=    1697
>    200     1006 this  s=    1713
>    194     1012 this  s=    1585
>    156     1303 this  s=    1569
>    155     1014 this  s=    1761
>    141     1016 this  s=    2017
>    114     1020 this  s=    2033
>    111     1246 this  s=    1553
>    103     1025 this  s=    1777
> (and all following are <100 for the first column)

Ouch! 1216 different block sizes in the same file!

Ben, I think we need to fix the diffing algorithm. This doesn't seem 
reasonable to me.

> Unfortunately I don't understand BoxBackup's diffing and block-size 
> algorithms, so I don't know what to conclude from my above listing. :-)
>
> Do I understand correctly, that BoxBackup tries to find the smallest
> possible block size to transmit (and store) changes?

No, it picks an "appropriate" block size for each chunk that it detects 
has changed. Personally I don't think this is particularly smart, I think 
we should keep the same block size for the whole file.

> OTOH I have a suggestion for the block algorithm. The block size can be 
> defined to a fixed size, e.g. 4k and you can accept that up to 4095 
> bytes might be duplicate. In my opinion wasting a few kB is tolerable.
>
> There is still a problem: insertions or deletions in the file can't be 
> identified this way. Imagine a single byte insertion at the very 
> beginning of the file. Then every 4k-aligned block will have changed and 
> the whole file needs to be updated. This problem has already been 
> addressed by rsync and is described at 
> http://rsync.samba.org/tech_report/ ("The rsync algorithm" and "Rolling 
> Checksum").

I believe that we already implement this, albeit modified to work with 
encrypted data. See

   http://bbdev.fluffy.co.uk/svn/box/trunk/docs/backup/encrypt_rsync.txt

for details.

> BTW: The (inofficial) BoxBackup Debian package from 
> http://debian.myreseau.org/ doesn't contain the bbackupobjdump tool, so 
> I checked out the trunk directory from the SVN. It doesn't build 
> bbackupobjdump automatically, so I had to trick a little bit. How can I 
> do this "beautifully"?

Sorry, the easiest way is to configure, then cd bin/bbackupobjdump; make; 
cd ../..; debug/bin/bbackupobjdump ... .

Cheers, Chris.
-- 
_____ __     _
\  __/ / ,__(_)_  | Chris Wilson <0000 at qwirx.com> - Cambs UK |
/ (_/ ,\/ _/ /_ \ | Security/C/C++/Java/Perl/SQL/HTML Developer |
\ _/_/_/_//_/___/ | We are GNU-free your mind-and your software |