[Box Backup] Thoughts on reliability (Was: "Box Backup - corrupted store")

Ben Summers boxbackup@fluffy.co.uk
Tue, 21 Jun 2005 16:07:35 +0100


On 21 Jun 2005, at 10:54, Gary wrote:

> Ben,
>
>
>> 1) Write to temp files, then move into place to commit. Everything
>> 2) Careful ordering of operations, exception handlers and destructors
>> 3) Fault tolerance -- expect things to go wrong. If something could
>>
>
> Sounds like a good plan, but considering the fact that a backup of
> electronic property is often one of the most valuable company  
> assets, I
> think employing a transactional file system (through an API)

Which one did you have in mind? (I suppose SVN is a possibility.)

> would be
> the best way to go. I would be happy to look into this, but are you
> planning to change the backend code  substantially in the next  
> version?
> Would that work be, effectively, "lost"?

Possibly. It would depend on how it was done. Since the server does  
everything through the interface provided by lib/raidfile and that  
certainly wouldn't be changed, you could modify that. And then it's  
just a matter of adding begin and end transaction calls to the code,  
which could potentially be simply added as part of the autogenerated  
protocol code.

I would suggest discussing any plans before commencing work.

>
>
>> every few requests. If the server child process terminates
>> unexpectedly, it will be out of date. So the code which allocates a
>> new ID assumes that it may be out of date, and the housekeeping
>> routine will correct out of date space information.
>>
>
> Actually, that one gave quite a run-around a few times, until I proved
> to myself that it was harmless. Oh, well, I guess I am the extremely
> paranoid type :).

For backups, paranoia is justified.

>
>
>> You would have to trust the server to report correct results. Either
>> a checksum of the encrypted data could be added, but perhaps more
>> space efficiently a single checksum for the entire store file could
>> be added. There may be issues with calculating this when patches are
>>
>
> Exactly. If a server-content checksum is block-based and does not
> "care" about the content of a file, then it should work fine - all we
> need to verify is that what's stored on the server hard drive is
> exactly what has been uploaded and corresponds to a set of client
> blocks.

Yes. But it should be calculated by the server, and perhaps confirmed  
by the client, otherwise we start uploading and downloading  
unnecessary data.

>
>
>> I would prefer a challenge-response system, where the client
>> challenges a server with some data, and it replies with some more
>> which could only be calculated if the server held a proper copy of
>> the file.
>>
>
> I beg to differ here, I think it would be much more valuable to  
> confirm
> server content integrity without any access to client data. A backup
> server admin should not have to rely on his or her users reporting
> server corruption, but should be able to cron a verification process
> every, say, 12 hours, and get notified immediately over
> e-mail/beepter/SMS, if something has gone wrong with the store (that
> might require another client backup ASAP).

There's no reason there shouldn't be an additional mechanism for the  
server to verify integrity. My focus in the project is to allow a  
client to backup to a server in which is has minimal trust. Any trust  
over and above trusting it to make your data available in the future  
is undesirable.

On 21 Jun 2005, at 11:03, Gary wrote:
>
> One more thought: in another thread you mention that large file  
> compare
> runs of well over 5 minutes could indicate a bug in Cygwin/Win32 code
> (both distributions behave the same, thus requiring the unlimited
> diffing time modification I have done). Could you be more specific? I
> would be very happy to see those high-CPU utilization compare runs
> shortened.

My comment was made because I would not expect this to happen in the  
circumstances mentioned (if I remember correctly). However, it might  
not indicate a bug. And anyway, someone is rewriting the core  
algorithm to make it faster, so I would hold off looking at this at  
the moment.

Ben