[Box Backup] Old vs Deleted files removal

Søren Døssing boxbackup@boxbackup.org
Fri, 20 Feb 2009 10:53:45 +0900


On Thu, Feb 19, 2009 at 8:32 AM, Chris Wilson <chris@qwirx.com> wrote:

> I think all the action happens in bin/bbstored/HousekeepStoreAccount.cpp. It
> starts in HousekeepStoreAccount::DoHousekeeping(), but that doesn't do much.
> The real action I suspect is in HousekeepStoreAccount::ScanDirectory() which
> is called on the root directory and calls itself recursively, building up a
> "map to count the distance from the mark" as it goes. That's where I get
> lost, I don't know what the "mark" is or what's done with this list yet.

Chris, thanks for your explanation. I looked inside the source code to
see if I could find out what was going on. I think I got at least one
step further towards a reasonable solution.

In HousekeepStoreAccount.h I find the following explanation of Mark:

                int32_t mVersionAgeWithinMark;  // 0 == current, 1
latest old version, etc

And in HousekeepStoreAccount.cpp I find the follow calculation of Mark:

                        // Work out ages of this version from the last mark
                        int32_t enVersionAge = 0;
                        std::map<std::pair<BackupStoreFilename,
int32_t>, int32_t>::iterator
enVersionAgeI(markVersionAges.find(std::pair<BackupStoreFilename,
int32_t>(en->GetName(), en->GetMarkNumber())));
                        if(enVersionAgeI != markVersionAges.end())
                        {
                                enVersionAge = enVersionAgeI->second + 1;
                                enVersionAgeI->second = enVersionAge;
                        }
                        else
                        {

markVersionAges[std::pair<BackupStoreFilename, int32_t>(en->GetName(),
en->GetMarkNumber())] = enVersionAge;
                        }
                        // enVersionAge is now the age of this version.

The way I read it is; Assume Mark is 0. If other versions of same file
is found, sort them and set Mark to the sorted position.

I'm speculating that the problem here is that the latest version of a
deleted file is 0. This would be wrong; the latest backup version of a
deleted file is not current. It should be 1 to indicate that it's one
version different than what is on client disk, where the file is now
gone from.

A simple change is to bump up the enVersionAge by 1 for deleted files
to make them comparable in revision to old files.

*** HousekeepStoreAccount.cpp.orig	2008-01-29 09:58:25.000000000 +0900
--- HousekeepStoreAccount.cpp	2009-02-20 09:54:55.000000000 +0900
*************** bool HousekeepStoreAccount::ScanDirector
*** 392,397 ****
--- 392,398 ----
  				markVersionAges[std::pair<BackupStoreFilename,
int32_t>(en->GetName(), en->GetMarkNumber())] = enVersionAge;
  			}
  			// enVersionAge is now the age of this version.
+ 			if(enFlags & BackupStoreDirectory::Entry::Flags_Deleted) ++enVersionAge;
  			
  			// Potentially add it to the list if it's deleted, if it's an old
version or deleted
  			if((enFlags & (BackupStoreDirectory::Entry::Flags_Deleted |
BackupStoreDirectory::Entry::Flags_OldVersion)) != 0)

This is just an idea for a quick fix. It's entirely untested, and
probably wrong if my assumptions are incorrect. For example, I'm not
sure if enVersionAge is really counting version numbers or holds age
of file in seconds. I'm also concerned that enVersionAge might be used
as index to an array, and incrementing it could go out of bounds for
the array.

That said, is anybody able to test and confirm if this approach would
fix the problem?

Soren