[Box Backup] New encryption format & Space usage

Ben Summers boxbackup@fluffy.co.uk
Thu, 29 Apr 2004 15:00:32 +0100


On 29 Apr 2004, at 14:04, rprice@freeshell.org wrote:

>>
>> * Use AES for file data
>>    - After long consideration, I think AES is a better choice of 
>> cipher
>> for file data, mainly due to the larger block size. Blowfish is still
>> used for meta data (as it's more space efficient with it's 64 bit 
>> block
>> size), and Blowfish encoded files can still be restored.
>
> Hi Ben,
>
> How does it move the data forward?
>
> To try and explain:
>
> Does it convert the data when the client next connects, at server 
> startup
> time, does it just let the old (encryption) format data expire?
>
> After re-reading your email it looks like it lets the old data expire, 
> but
> I was just wondering.

Files are made up of blocks, from about 4k -- 64k depending on file 
size. Each block is compressed if it is over 256 bytes long. Then it is 
encrypted. Each block has a 1 byte header saying whether it is 
compressed, and which encryption algorithm is used. (This was always 
the case.)

For the AES thing, I've just added a new algorithm, and encode blocks 
with it by default (with a separate key). So new blocks (remember the 
diffs!) are AES, old ones are blowfish, and the store will gradually 
move over to them as time goes on. Files which have been patched may 
contain blocks with both encryption methods.

There's another change regarding the block indices. There was a bug 
relating to endianness. To fix this, there's a new version of the file 
format, which is almost exactly the same. With backwards compatibility 
enabled, the old and new formats are parsed successfully. But you can't 
diff from old to new.

Then the attributes: Changes are detected using a hash rather than the 
attribute mod time. The hash is stored in the same place as the 
attribute mod time was. So it won't match, so attributes will be 
uploaded for each file when it first sees a directory.

Hope that explains it.

>
>
> ----
>
> One thing thats annoying me lately (and it's not your fault of 
> course), is
> trying to find out what exactly I'm backing up when the daemon runs. I
> tried the option in the config file that should tell you what's going 
> on,
> but what I was really hoping for was a list of files (and their sizes)
> that were being backed up.

Extended logging will give you some hints.

>
> The problem behind this, is that I see some pretty huge numbers for 
> amount
> of data backed up sometimes (and even for uploaded), and I feel that my
> machine should have been *quiescent* before the large amount of data 
> was
> backed up.

In the default configs, changes are uploaded 6 hours ahead. So it would 
have to be quiescent for over 6 hours for the upload to make no 
changes.

>
> I also just hit the wall with the amount of data on your store, I think
> yesterday, and I don't think that I should be backing up enough stuff 
> to
> cause that. I'm not saying I think your program is wrong, I'm just 
> trying
> to figure out where the fat file(s) is/are so I can decide if I really
> want it/them backed up. (The script to notify me of the problem also 
> had
> an error (Debian Testing), but I haven't had time to look at that yet.

You are requested to test and possible change that by the config script 
-- but let me know if you have a patch for it.

>
> I started looking around last night, and I seemed to have some rather
> large files in my wife's email related folders, 'caught spam' (which 
> only
> I check) where she gets a couple of hundred spam a day, and the files
> related to spamassassin. So I'm thinking they were the culprit, it may 
> be
> that the file gets some big changes which spook the delta algorithm.

Did you just upgrade the client to 0.05PLUS1B? It won't diff if the old 
version of the file was written before you upgraded.

>
> I'm thinking of using *find* to try and hunt down the large files in 
> the
> directories,

This is probably the best approach.

>  but, I guess even some stats in the syslog would be helpful,
> IE number of files backed up, maybe even some size stats (smallest,
> largest, average - for file size and delta size), so I could try and
> decide if there were 1000 small files that were changed, or one huge 
> file
> that somehow was changed.

I'll put some more stats in.

>
> This sort of thing would be helpful for someone trying to manage their
> data on a third party data store (as you are for me). I'd be 
> particularly
> interested in doing say trending analysis to find out what was 
> happening,
> and to look for ominous changes. Or even to predict when I'd hit the 
> wall,
> and need to *purchase* more space from the data store provider.

In bbackupquery, type "usage" for space usage on the server. (New in 
0.05PLUS1B.) Or

   /usr/local/bin/bbackupquery usage quit

as a simple one liner.

(I've just increased your allocation a bit, BTW)

>
> Part of my interest is also related to bandwidth usage, I'm on a ADSL
> connection, but some of my friends who would be backing up to me would 
> be
> on dialup, being able to manage the bandwidth usage would be helpful,
> knowing what sort of changes were happening would help that way (you 
> could
> try and have multiple backup sets or know *why* a huge upload 
> happened).
> Other possible options would be deferring large delta's until a set 
> time,
> while letting small changes through. Although that could be risky and
> complicated.

I'm designing taking ADSL a minimum requirement for this system. That's 
not to say it couldn't be used with dial-up, it's just it's not 
designed for it.

Ben