[Box Backup] Old vs Deleted files removal

Torsten boxbackup@boxbackup.org
Wed, 22 Jul 2009 16:34:33 +0200


because of my keep-files-28-days patch i tried to understand the logic of=20
Housekeeping and to solve the problem old vs deleted files. Here is my=20

Am Friday 20 February 2009 02:53:45 schrieb S=F8ren D=F8ssing:
> On Thu, Feb 19, 2009 at 8:32 AM, Chris Wilson <chris@qwirx.com> wrote:
> > I think all the action happens in bin/bbstored/HousekeepStoreAccount.cp=
> > It starts in HousekeepStoreAccount::DoHousekeeping(), but that doesn't =
> > much. The real action I suspect is in
> > HousekeepStoreAccount::ScanDirectory() which is called on the root
> > directory and calls itself recursively, building up a "map to count the
> > distance from the mark" as it goes. That's where I get lost, I don't kn=
> > what the "mark" is or what's done with this list yet.
> Chris, thanks for your explanation. I looked inside the source code to
> see if I could find out what was going on. I think I got at least one
> step further towards a reasonable solution.
> In HousekeepStoreAccount.h I find the following explanation of Mark:
>                 int32_t mVersionAgeWithinMark;  // 0 =3D=3D current, 1
> latest old version, etc
> And in HousekeepStoreAccount.cpp I find the follow calculation of Mark:
>                         // Work out ages of this version from the last ma=
>                         int32_t enVersionAge =3D 0;
>                         std::map<std::pair<BackupStoreFilename,
> int32_t>, int32_t>::iterator
> enVersionAgeI(markVersionAges.find(std::pair<BackupStoreFilename,
> int32_t>(en->GetName(), en->GetMarkNumber())));
>                         if(enVersionAgeI !=3D markVersionAges.end())
>                         {
>                                 enVersionAge =3D enVersionAgeI->second + =
>                                 enVersionAgeI->second =3D enVersionAge;
>                         }
>                         else
>                         {
> markVersionAges[std::pair<BackupStoreFilename, int32_t>(en->GetName(),
> en->GetMarkNumber())] =3D enVersionAge;
>                         }
>                         // enVersionAge is now the age of this version.
> The way I read it is; Assume Mark is 0. If other versions of same file
> is found, sort them and set Mark to the sorted position.

The files are not really sorted but only counted how many newer versions of=
this file exists. On every ScanDirectory we work only on files of this=20
directory. We are running in reverse Order (=3D=3D reverse order of object =
So we first take newest files until the oldest available file.

We check every file if it is already in markVersionAges list. If it is not=
than we add it. If we find it, than this file is an older version of the fi=
in the list, so we add 1 to enVersionAgeI->second.

I can not find any code that influences this magic markNumber of a file. So=
made an BOX_INFO to en->GetMarkNumber() and it is always 0. I have about a=
half million files in 3 backup accounts but on every housekeeping and every=
file this MarkVersion is always zero. So i think we can ignore this. I do n=
know what this is for. Does anybody know this?

We write the calculated enVersionAge to the object DelEn here:

			d.mVersionAgeWithinMark =3D enVersionAge;

Later we add this element to a sorted list:


That is the only magic here.

Next step is DeleteFiles. This method runs on mPotentialDeletions until eno=
files are deleted.

The surprising thing here for me is, that files are not deleted at the orde=
oldest first. The compare function for the sorted list mPotentialDeletions=
is "HousekeepStoreAccount::DelEnCompare::operator()". This does not compare=
timestamps but first it compares mVersionAgeWithinMark.

Conclusion: ScanDirectory first deletes files with the highest count of old=
version, oldest first. So i never can guarantee that all changes of last we=
are still stored in the backup. If i change one file very often, so perhaps=
version from yesterday is deleted but older versions from another file is=20
still backed up.=20

And that is the problem of old vs deleted files. Soren found the problem.=20
Deleted files have no old version of a file so VersionAge keeps 0. That mea=
that deleted files are only removed, if old versions are removed and we are=
still not below softlimit.

> I'm speculating that the problem here is that the latest version of a
> deleted file is 0. This would be wrong; the latest backup version of a
> deleted file is not current. It should be 1 to indicate that it's one
> version different than what is on client disk, where the file is now
> gone from.
> A simple change is to bump up the enVersionAge by 1 for deleted files
> to make them comparable in revision to old files.
> *** HousekeepStoreAccount.cpp.orig	2008-01-29 09:58:25.000000000 +0900
> --- HousekeepStoreAccount.cpp	2009-02-20 09:54:55.000000000 +0900
> *************** bool HousekeepStoreAccount::ScanDirector
> *** 392,397 ****
> --- 392,398 ----
>   				markVersionAges[std::pair<BackupStoreFilename,
> int32_t>(en->GetName(), en->GetMarkNumber())] =3D enVersionAge;
>   			}
>   			// enVersionAge is now the age of this version.
> + 			if(enFlags & BackupStoreDirectory::Entry::Flags_Deleted)
> ++enVersionAge;
>   			// Potentially add it to the list if it's deleted, if it's an old
> version or deleted
>   			if((enFlags & (BackupStoreDirectory::Entry::Flags_Deleted |
> BackupStoreDirectory::Entry::Flags_OldVersion)) !=3D 0)
> This is just an idea for a quick fix. It's entirely untested, and
> probably wrong if my assumptions are incorrect. For example, I'm not
> sure if enVersionAge is really counting version numbers or holds age
> of file in seconds. I'm also concerned that enVersionAge might be used
> as index to an array, and incrementing it could go out of bounds for
> the array.
> That said, is anybody able to test and confirm if this approach would
> fix the problem?

So Sorens assumtion is correct. If we increment enVersionAge of DeletedFile=
than they are handled like old versions of files. But i think this is not t=
best solution.

I prefer to delete files depending on their timestamp, oldest first. This a=
would only be a little change. The object DelEn must store the timestamp of=
the file from en->GetModificationTime() and in=20
HousekeepStoreAccount::DelEnCompare::operator() we have to delete compariso=
of mVersionAgeWithinMark and substutude with comparison of the timestamps.


thanks for reading