[Box Backup] BadBackupStoreFile

Chris Wilson boxbackup@fluffy.co.uk
Tue, 4 Dec 2007 20:40:36 +0000 (GMT)


Hi Johann,

On Tue, 4 Dec 2007, Johann Glaser wrote:

> > Each file will be diffed for up to 20,0hint00 seconds, nearly 6 hours. 
> > If you have a number of large files that really can take 6 hours to 
> > diff each, then your backup will run... really... slowly.
> 
> Yes, sure, but what happens when the diff was canceled due to time out? 
> Either BoxBackup assumes the file didn't change or it assumes it has 
> changed. Both cases are bad. In the first case changes at the end of the 
> file will not be backed up. In the second case the huge file will be 
> backed up every day again, even if there were no changes. And the file 
> will be backed up as a whole, even if just one additional byte was 
> appended.

Not exactly. It appears that if the diff times out, Box Backup will use 
whatever matching blocks it has found so far, and resend the rest. I guess 
which you prefer (uploading more data or spending more time diffing) 
depends on your tradeoff between backup speed and bandwidth, but 20,000 
seconds feels too long to me (I'd rather not have my client not upload 
anything for six hours while it diffs).

> > Yes, it does kind of make sense. The file on the server can easily have a 
> > mix of block sizes, and we need to calculate whether we have matches for 
> > any of those blocks. Currently we can only search for matches using one 
> > block size at a time, so we have to re-read the file once for each block 
> > size (up to 128 times, I think, in the worst case).
> > 
> > It would be possible to optimise this to calculate checksums and find 
> > matches for all block sizes in a single pass, but nobody has done that 
> > work yet.
> 
> So, is it just an implementation thing, and no conceptual thing?

Yes, I believe so.

> > Noted on [http://www.boxbackup.org/trac/wiki/ConfiguringAServer].
> 
> Great, thanks. I've added a little more text to guide users to external
> HDDs (USB, FireWire, eSATA).

Thanks.

> > > Regarding modifying BoxBackup to tolerate slow storage: Changing its
> > > storage format should be avoided. :-)
> > 
> > But may be necessary, unfortunately. Large files tend to perform better 
> > than small ones, especially for streaming and in reducing disk overhead, 
> > and if we want to support Amazon S3 then it seems to be the way to go.
> 
> Hmm, using the backend data store of BoxBackup via a network (either S3
> or NFS) is IMHO a bad idea. It would be better to have the BoxBackup
> Server directly at the remote site so that only the BoxBackup protocol
> (using diffs) is transmitted via the network.

But it's not necessarily possible in all cases (e.g. unless you have an 
Amazon EC2 server). And if you do, then you still pay something for each 
request, and the only way to reduce the number of requests seems to be to 
use larger files (multiple client files' blocks in a single file on the 
server). Unless you can think of a better way?

Cheers, Chris.
-- 
_____ __     _
\  __/ / ,__(_)_  | Chris Wilson <0000 at qwirx.com> - Cambs UK |
/ (_/ ,\/ _/ /_ \ | Security/C/C++/Java/Perl/SQL/HTML Developer |
\ _/_/_/_//_/___/ | We are GNU-free your mind-and your software |