[Box Backup-dev] Proposal: strong file content checksum for BoxBackup file change detection.
G.
boxbackup-dev@fluffy.co.uk
Tue, 29 May 2007 10:04:17 -0700 (PDT)
Hi everyone,
Proposal:
------------------
Strong file content checksum for BoxBackup file change detection.
Problem:
------------------
Under certain circumstances a file can change its content without changing its size or timestamp. An example might be a TrueCrypt secure, encrypted fixed-size volume with "plausible deniability" feature, which keeps a volume file timestamp constant, in order to hide usage patterns.
Current BoxBackup folder-level and file-level change detection algorithms are not capable of detecting such modifications, resulting in inconsistent backups.
Solution:
------------------
Include per-file MD5 signatures in folder-level checksum generation algorithm to detect changes at folder scan -level. Re-use per-file MD5 signatures for file-level checksum generation algorithm to detect changes at file scan -level. Persist file-level MD5 checksums in dynamically-sized file attribute stream (?) to avoid backup store upgrade and migration problems.
Applicability:
------------------
The feature should be optional, but available for those who want to be sure of 100% consistent backup snapshots and are willing to sacrifice scanning cycle performance.
Side-Effects:
------------------
The ability to run a very fast bbackupquery compare cycle, utilizing MD5 signatures, as opposed to the current "compare -aq", which needs to download all block-level checksums for all files from a remote server.
Prototype:
------------------
A simple simulation that generates folder-level MD5 signatures for all files in all folders to detect content change. Evaluates raw performance degradation for a file scanning cycle and overall backup performance degradation for a file scanning cycle along with minimal, but significant, network traffic. Assumes low-end network connectivity: an over-the-Internet backup to a server half a world away.
It should be noted that I define "true performance penalty" as a percentage of total backup time, thus the percentage of time one really sacrifices in respect to an entire backup cycle due to MD5 signature generation.
Hardware:
------------------
* QX6700 quad-core
* 10K Raptors
* RAID1
* Windows Vista 32-bit
* NTFS
Sample Backup:
------------------
* ~2.5GB
* ~15,000 files
* ~3,000 folders
Summary:
------------------
1.) Raw performance degradation for a file scanning cycle.
* no content changes
* Vanilla: weak folder-level checksums examined (attributes, mod time, etc.)
* MD5: strong file-level checksums examined to calculate strong folder-level checksums (attributes, mod time, etc. + content)
* no network traffic
a.) Vanilla:
* runtime: 8 seconds total
b.) MD5:
* runtime: 1 minute, 45 seconds >> 105 seconds total
* penalty: 97 seconds, ~825%
2.) Overall backup performance degradation for a file scanning cycle and minimal, but significant, network traffic.
* changes reported artificially for all folder-level checksums
* Vanilla: weak file-level checksums examined (attributes, mod time, etc.)
* MD5: strong file-level checksums examined (attributes, mod time, etc. + content)
* client/server ListDirectory commands executed for all folders, thus simulating ~20% potential delta
* no file content sent
a.) Vanilla:
* runtime: 5 minutes, 56 seconds >> 356 seconds total
b.) MD5:
* runtime: 8 minutes, 29 seconds >> 509 seconds total
* penalty: 153 seconds, ~42%
Conclusions:
------------------
1.) For small-delta, frequent real-time scanning environments, MD5 checking would incur large penalty of ~500% - ~800%. However, it is important to note that the 800% degradation in question applies to the total scanning time of mere seconds or a couple of minutes, resulting in an overall loss of a few additional minutes of scanning time.
2.) For infrequent scanning environments with excellent network connectivity, MD5 checking would incur raw penalty of ~30% - ~40%. It should be noted that even without any content change, BoxBackup client/server ListDirectory commands alone have decreased the significance of MD5 signature generation penalty by an order of magnitude.
3.) For massive delta, large file diffing time and content upload infrequent scanning environments, network traffic seems to be the most limiting factor. Overall MD5 checking penalty would be probably in the ~10% - ~15% range, or inconsequential. This is a guess-estimate, but consistent logic-wise and consistent with my earlier prototype experiments. As an example, the ~800% scanning-only penalty was 1.5 minutes. I have a single large-file diffing time-out set to 10 minutes, otherwise network traffic overwhelms the overall backup cycle.
4.) Potential scanning speed improvements could involve weaker, but still reliable signature generation algorithms (MD4, CRC-32, etc.) and parallelism for multi-core systems.
---
Thoughts? Anyone else interested in such a feature?
Gary
____________________________________________________________________________________
Need Mail bonding?
Go to the Yahoo! Mail Q&A for great tips from Yahoo! Answers users.
http://answers.yahoo.com/dir/?link=list&sid=396546091