[Box Backup] Please comment: strong file content checksums for file change detection.
G.
boxbackup@fluffy.co.uk
Tue, 25 Sep 2007 07:07:07 -0700 (PDT)
Hi everyone,
There is an issue with BoxBackup that I'd like to bring up.
Under certain circumstances a file can change its content without changing its size or timestamp.
An applicable example might be a secure, encrypted TrueCrypt fixed-size volume with "plausible
deniability" feature, which - by design - keeps a volume file timestamp constant, in order to hide
usage patterns. A couple more examples, although not so clear as to their rationale, are Windows
TortoiseSVN and Windows Excel, which I have observed modifying their payload files without
changing timestamps.
Current BoxBackup folder-level and file-level change detection algorithms are not capable of
detecting such modifications, resulting in inconsistent backups. I've already run into a problem
when an SVN repository backed up by BoxBackup was an inconsistent throw-away after a restore.
There is, of course, the question of whether it is reasonable for a backup product to support such
scenarios. IMHO, a backup system should be able to accommodate any backup set for a consistent
backup (adaptability), as opposed to forcing upon a backup set its consistency requirements.
Finally, many serious backup/replication packages out there implement some kind of content-based
change detection: Bacula, rsync, robocopy, if my memory serves me.
In my discussions with Chris about this problem, the following thoughts have come up:
a.) Checking for OS file change notifications. Requires less CPU and disk usage on the client, but
cannot cope with changes that happen while bbackupd is not running.
b.) Where bbackupd runs on a unix system (and filesystem) we could simply check the ctime as well
as, or instead of, the mtime, since applications cannot modify the ctime in any way.
Unfortunately, Windows has nothing similar.
c.) Include per-file MD5 signatures in folder-level checksum generation algorithm to detect
changes at folder scan -level. Re-use per-file MD5 signatures for file-level checksum generation
algorithm to detect changes at file scan -level. Persist file-level MD5 checksums in
dynamically-sized file attribute stream (?) to avoid backup store upgrade and migration problems.
However, there would be no direct, verifiable relationship between client-side MD5 checksums and
server-side block indexes.
d.) A server could calculate encrypted-block checksums for each block, and return these together
with the IV for each block to the client. The client could then use the IV to calculate the
checksum for each block locally and check that it still matches. Would require a protocol change,
but would provide end-to-end content verification, thus a guarantee that client-side checksums
reflect actual server-side content. Tricky to implement?
e.) The folder-level checksum generation algorithm could be augmented with file-level MD5
checksums in order to allow a complete folder enumeration for folders possibly containing files
with modified content regardless of timestamps. However, during a folder enumeration process, file
content would be checked against the server content through QueryGetBlockIndexByID() and
CompareFileContentsAgainstBlockIndex(). Kind of dynamic "compare -aq", basically. What's needed to
make this practical, however, is a new server-side command to download a complete
object-id-to-block-index map for a given set of folders all at once, compressed.
Thoughts? Comments? Anyone else interested in such a feature?
Gary
____________________________________________________________________________________
Catch up on fall's hot new shows on Yahoo! TV. Watch previews, get listings, and more!
http://tv.yahoo.com/collections/3658