[Box Backup-dev] Future work

Sat, 25 Feb 2006 13:11:41 +0000

On Fri, 2006-02-24 at 14:37 +0000, Ben Summers wrote:
> I have written up some notes about how I'd like to see the project  
> develop:
> 
>     http://boxbackup.hostworks.ca/index.php/0.20_redesign
> 
> Comments welcomed!

File selection
- Agreed

Upload engine 
- Agreed

Streams in files 
- I prefer properly supporting multiple streams, although I'm not
against the multiple files idea if we can demonstrate it is
significantly simpler.

Attributes 
- Agreed

Store 
- Agreed

Directory structure 
        I would like to completely change the way directories are
        stored. This allows 1) the directories to be signed and 2) to
        have a lovely way of going back in time.
- Both good

        The directory is a hierarchy, as now, with a directory referring
        to subdirectories. However, we have a new object, which is a
        "backup set". This specifies a root directory, and a list of
        grafts.
        A graft is a pair of (original dir ID, replacement dir ID). So
        when you are scanning a tree, if you see a reference to the
        original dir ID, you use the replacement one instead. This
        allows efficient replacement of leaves of the tree -- otherwise
        we would have to duplicate it from the root.
- Ok

        Filenames are no longer encrypted as separate objects, as this
        was shown to be a pain and annoying inefficient. Instead, the
        directory is stored as a single encrypted stream. This is
        possible as the store no longer needs to modify them. However,
        the referred object IDs need to be stored in the clear so the
        server can use them, but of course, these are included in the
        signed data so the server can't modify them without detection.
- Mostly agreed, though I'd be concerned about the use case where there
is a directory with many entries which has small changes to it. Think
news spool, Maildir directories, etc. Ideally wouldn't have to reupload
the whole thing on every change.

        Attributes are only stored in the directory, never in the file.
        (Although, what about xattrs, which could be "big"? But
        relevant, because they can store ACLs.)
- Attributes includes things such as mtime and size? Presumably we don't
back up atime? I'd be concerned that we'd upload the whole directory
object just for one frequently changing file.

        Each time bbackupd connects, it makes a new backup set, marked
        with the current time. Rules are specified to the server (or
        client?) to say when a backup set can be automatically deleted.
        By default bbackupquery will use the latest backup set, but can
        be instructed to use a different one.
- Don't really like this. Doesn't sound like it works well for lazy
mode, which you confirmed below.

Lazy mode 
        This was the only mode the original supported. It has it's pros
        and cons. The original decision was made to ensure a slow
        trickle of data across a broadband connection.
        The above may make supporting lazy mode a bit tricky, and we
        should decide whether we want to keep it, and if so, in what
        form. For example, directory entries may need to be marked as
        "changed but not included", prompting the restore to look in
        future versions.
- Lazy mode is essential for me. It's my favourite feature of Box and is
what makes it ideal for backing up over a DSL link. With the addition of
inotify (or equivalent file notification support) and restart resume it
will finally work perfectly. I can't see the point in going to all the
trouble of adding inotify support if we then just end up batching it up
at the other end of bbackupd instead!

Rather than making it messy, I think we'd be better off coming up with a
design whereby lazy mode worked very well. I was thinking of something
like your grafts, but each would have a time range. ie. When you fetch
an object it has multiple possibilities, all with a valid from/valid to
range.

You could go into the store and enter any time as view time and then for
every object you saw you'd get the version valid at that moment.

Snapshot mode would be identical, with the addition that every time it
did an upload it would record the time in a list. Then it would be
simple to select the view time from the list of snapshots.

> Also, there's a mini-project suitable for one developer to do  
> independently of everything else, making the raidfile support better  
> and efficient in a cluster of three store servers:
> 
>    http://boxbackup.hostworks.ca/index.php/Raidfile_improvements
> 
> Anyone up for it?

I'm not convinced of the advantage of building in raid support. I prefer
to backup to two stores on different machines rather than raid.
Obviously for companies providing the service raid would be well worth
having, but raid is so available these days that you can guarantee
they've got it already.

Maybe the three server cluster is a bonus, but there might be a much
less complicated way of achieving that rather than using raid.

What I'm up for:
- inotify support in linux, as I already stated. Hence I'd be happy to
join in with that section of redesign to abstract the file searching
interface out. If we decide this would also benefit from a db backend
then I guess I'll help on that too.

- Switching to ostreams to remove the hundreds of warnings and increase
robustness. This will probably also include changing the logging as
Chris has already suggested. Ideally this should be in 0.11.

- Making the underlying box libraries into shared libs so that other
projects (eg. boxi) can use them easily without having to bastardise the
box source. Other possible build things such as supporting PREFIX etc
properly.

- Changing the store to allow simple point-in-time retrieval, as
discussed above. This could be quite a big job though and like everyone
else I don't have a whole lot of time, but I'll try and help out with
this one.

Cheers,

Martin.