[Box Backup] Thoughts on reliability (Was: "Box Backup - corrupted store")
Ben Summers
boxbackup@fluffy.co.uk
Mon, 13 Jun 2005 10:50:56 +0100
On 11 Jun 2005, at 10:59, Gary wrote:
> Ben,
>
>
>> I suspect that in this case of corruption, a block got corrupted
>> somehow after it was committed.
>>
>
>
>> Berkeley DB is only used on the client to track non-essential data.
>>
>
> My bad, I have not looked at the server sources that deep, yet. At any
> rate, how does the server side guarantee consistency in case of an
> interrupted upload or interrupted server-side file write (especially
> for large files)?
I use three techniques, which I believe are all standard good
practice for software engineers.
1) Write to temp files, then move into place to commit. Everything
does this, as part of the lib/raidfile code. Since moving a file over
another should "always" work, this means that either all of a change
will happen, or nothing.
2) Careful ordering of operations, exception handlers and destructors
which clean up by default. If there's another failure while
processing a request, exception handlers and object destructors will
clear it up. By ordering the code carefully, everything can be undone
until the last minute. In addition, if my assumption that the "moves
into place" for turning the temp files into live files turns out to
be incorrect, the exception handlers will recover.
3) Fault tolerance -- expect things to go wrong. If something could
have gone wrong in the past, the code will be tolerant of the mess it
leaves behind and correct it as it goes along. The best example is
the "store info file", which contains details of the space used and
last ID allocated. This should be written to disc after every
request, but it would be vastly inefficient so it's written lazily
every few requests. If the server child process terminates
unexpectedly, it will be out of date. So the code which allocates a
new ID assumes that it may be out of date, and the housekeeping
routine will correct out of date space information.
>
>
>> It compares MD5 checksums of each block with the index. Using a hash
>> is good enough for digital signatures, so I hope it's good enough for
>> this.
>>
>
>
>> I suppose a MD5 of the encrypted data could be kept as well, and the
>> server could verify that the file is still OK?
>>
>
> There seem to be two separate issues here: comparing server index
> content with client content (-q), and verifying that server index
> content is actually what is on a server hard drive. The first issue
> has
> already been solved by comparing current client content blocks with
> last known uploaded blocks on the server (index), right?
Yes.
> The second
> issue is currently not implemented, but, if I understand correctly, if
> a function is added to make sure that the server side index is
> actually
> what's currently on the server hard drive, then a complete end-to-end
> compare has been achieved.
>
> I guess we would need some kind of "association" of reported client
> blocks (index content) with factual server content.
You would have to trust the server to report correct results. Either
a checksum of the encrypted data could be added, but perhaps more
space efficiently a single checksum for the entire store file could
be added. There may be issues with calculating this when patches are
in use, so it's not as simple as it might seem.
>
> Did I get this right?
The principle I think is right, although the implementation may be
tricky. I would prefer a challenge-response system, where the client
challenges a server with some data, and it replies with some more
which could only be calculated if the server held a proper copy of
the file. This may not be practical, of course.
Ben