[Box Backup] Deleting old snapshots

Ben Summers boxbackup@fluffy.co.uk
Thu, 7 Apr 2005 10:53:13 +0100


On 6 Apr 2005, at 00:20, Richard Wallace wrote:

> Per Thomsen wrote:
>> On 3/29/05 11:48 AM, Richard Wallace wrote:
>>> Hey guys,
>>>
>>> Been using Box Backup in snapshot mode for a while now and I'm 
>>> pretty happy with it.  Our backups are getting a little large and we 
>>> really only want to keep the data from the last 45 days.  Is there a 
>>> way to remove snapshots older than a certain time?
>> Rich,
>> Snapshots are not really snapshots in the traditional sense of the 
>> word, when you use Box. A snapshot just means that you decided to 
>> sync your disk to the backup store at a given time. So, you don't 
>> really have a physical set of files that are snapshotted, but rather 
>> a set of initial files, and patches applied to those files reflecting 
>> the changes between snapshots (it's a little more complicated than 
>> that, but that's the basic gist of it).
>
> I understand there isn't an actual set of files that makes up the 
> snapshot, it's really a set of deltas at the time the snapshot was 
> taken.  But you do bring up something else that I've been curious 
> about.  Each snapshot is really a delta.  But what is it a delta of?  
> Is it a delta from the initial (full) backup?  Or is it from the last 
> delta that was done of the file?

Currently, directories contain chains of deltas. If a client decides to 
upload a patch instead of a new file, that patch is applied to the 
relevant file, and that generated full version is stored as the latest 
version in the directory. Then, the patch is reversed (so it's a patch 
from the latest version to the previous version) and the previous 
(full) version is replaced with this patch.

So you end up with the latest version in a chain always being a full 
version, and a set of patches backwards. When you fetch an old version, 
patches are applied to build it for download.

Housekeeping can remove any version in a chain, and rebuilds patches as 
appropriate -- so any file in the directory can be removed at any time, 
but still maintaining the chain. Patches are combined properly too, 
it's not just simply concatenating the list of changes.

>
> I can see advantages/disadvantages to both ways.  With only taking 
> deltas from the initial backup you can get rid of any intermediary 
> deltas at any time.  But the deltas will increase in size as time goes 
> by and the file in question changes.  With the delta based on the last 
> delta method you can't really ever get rid of any of the snapshots, 
> but each one is relatively small.

If you change a file by a constant amount each time, the each patch in 
the chain should be a constant size.

This scheme gives reasonable access times for old versions, no penalty 
for the latest version, and storage efficiency. While still maintaining 
encryption.

>
> Both methods have the draw back that as time goes by the amount of 
> space needed to actually store the deltas will grow.  The only way to 
> solve that would be to "roll-forward" that initial backup.  That is, 
> when a user says "I don't want any snapshots older than x days," the 
> deltas up to that point in time are applied to the initial (full) 
> backup and it's brought up to date.  Of course, this only works if all 
> deltas are based on previous snapshots and not the initial full 
> backup.  But then you have people that say, "I want weekly, monthly 
> and yearly snapshots."  So even that method won't work for all.
>
> Sorry if this is getting a little long, but this is something I've 
> been thinking about quite a bit lately and I'm just curious how you 
> guys have approached the problem.

The current version stores all the info within directories, and it's 
those that essentially manage when files are removed. It's not working 
well for everyone, and doesn't support all the funky things that users 
are asking for.

In the version which supports snapshots, I plan to alter the way the 
server works. The chains of patches will be kept, but will be managed 
outside the directory structure. Also, the responsibility for managing 
the directories will lie with the client so that they can be signed 
(and also reduce the amount of churn on the server when uploading lots 
of files into a directory.)

Each snapshot will have a separate set of directories (simplification: 
directories which are unchanged will not be duplicated). Directories 
and files will all be referenced counted. Deleting a directory will 
reduce the reference count of other directories and files, and the 
housekeeping will delete the objects when they reach zero. (so no time 
consuming scanning either.)

Anyway... this means that you'll have multiple root directories to 
choose from, each relating to a snapshot in time. Don't want it any 
more? Just delete it, the reference counts will be updated lazily, and 
everything will just work out.

So snapshots will "just work", and the management side for implementing 
rules will be trivial. Maybe I'll implement it as a 10 line shell 
script? And also; directories can be signed, further protecting your 
data.

Hope that made sense.

As always, I have some lovely ideas, but less time to write them. It's 
annoying having to earn a living.

>
>> The way that Box works, the best way (IMHO) is to figure out the size 
>> of the backup set that you want, and then adjust your storage 
>> allocation to match that.
>> So, if you find that 45 days of backups takes up ~30GB of data, I 
>> would issue the following command:
>> bbstoreaccounts setlimit XXXXX 30G 31G
>> to set your limit to 30G with a 1G buffer.
>
> The problem with this is that you will have to continually monitor and 
> update that limit.  I mean, I've seen businesses that have their data 
> grow many gigs a week.  With a 'remove snapshots (or deltas) older 
> than x days' you only ever have to worry about keeping enough disk 
> space available.

Yes, that's not good.

Ben