[Box Backup] BadBackupStoreFile

Johann Glaser boxbackup@fluffy.co.uk
Tue, 04 Dec 2007 11:13:51 +0100


Hi!

Am Montag, den 03.12.2007, 22:15 +0000 schrieb Chris Wilson:
> Hi Johann,
> 
> Apologies for the late reply, I'm catching up on some unfinished emails in 
> my outbox that have been languishing for far too long.

No problem.

> On Mon, 10 Sep 2007, Johann Glaser wrote:
> 
> >>> I suspect the low diff time, which I've now increased to 20.000 
> >>> seconds. :-)
> >>
> >> That may be a bit too long! I wonder how long the diffs take in 
> >> practice? If we're constantly re-reading files then we're certainly 
> >> wasting a lot of time in diffing.
> >
> > Why should it be too long?
> 
> Each file will be diffed for up to 20,0hint00 seconds, nearly 6 hours. If you 
> have a number of large files that really can take 6 hours to diff each, 
> then your backup will run... really... slowly.

Yes, sure, but what happens when the diff was canceled due to time out?
Either BoxBackup assumes the file didn't change or it assumes it has
changed. Both cases are bad. In the first case changes at the end of the
file will not be backed up. In the second case the huge file will be
backed up every day again, even if there were no changes. And the file
will be backed up as a whole, even if just one additional byte was
appended.

> > The re-reading of files is indeed a strange thing. Does this make any 
> > sens?
> 
> Yes, it does kind of make sense. The file on the server can easily have a 
> mix of block sizes, and we need to calculate whether we have matches for 
> any of those blocks. Currently we can only search for matches using one 
> block size at a time, so we have to re-read the file once for each block 
> size (up to 128 times, I think, in the worst case).
> 
> It would be possible to optimise this to calculate checksums and find 
> matches for all block sizes in a single pass, but nobody has done that 
> work yet.

So, is it just an implementation thing, and no conceptual thing?

> Noted on [http://www.boxbackup.org/trac/wiki/ConfiguringAServer].

Great, thanks. I've added a little more text to guide users to external
HDDs (USB, FireWire, eSATA).

> > Regarding modifying BoxBackup to tolerate slow storage: Changing its
> > storage format should be avoided. :-)
> 
> But may be necessary, unfortunately. Large files tend to perform better 
> than small ones, especially for streaming and in reducing disk overhead, 
> and if we want to support Amazon S3 then it seems to be the way to go.

Hmm, using the backend data store of BoxBackup via a network (either S3
or NFS) is IMHO a bad idea. It would be better to have the BoxBackup
Server directly at the remote site so that only the BoxBackup protocol
(using diffs) is transmitted via the network.

Bye
  Hansi