[Box Backup] Old vs Deleted files removal

Torsten boxbackup@boxbackup.org
Mon, 27 Jul 2009 19:25:22 +0200


Hi,

sorry for the additional spam, but perhaps nobody can see my mail because o=
f=20
threaded view. I would like to have comments ...

Because of my keep-files-28-days patch i tried to understand the logic of=20
Housekeeping and to solve the problem old vs deleted files. Here is my=20
result.

Am Friday 20 February 2009 02:53:45 schrieb S=C3=B8ren D=C3=B8ssing:
> On Thu, Feb 19, 2009 at 8:32 AM, Chris Wilson <chris@qwirx.com> wrote:
> > I think all the action happens in bin/bbstored/HousekeepStoreAccount.cp=
p.
> > It starts in HousekeepStoreAccount::DoHousekeeping(), but that doesn't =
do
> > much. The real action I suspect is in
> > HousekeepStoreAccount::ScanDirectory() which is called on the root
> > directory and calls itself recursively, building up a "map to count the
> > distance from the mark" as it goes. That's where I get lost, I don't kn=
ow
> > what the "mark" is or what's done with this list yet.
>
> Chris, thanks for your explanation. I looked inside the source code to
> see if I could find out what was going on. I think I got at least one
> step further towards a reasonable solution.
>
> In HousekeepStoreAccount.h I find the following explanation of Mark:
>
>                 int32_t mVersionAgeWithinMark;  // 0 =3D=3D current, 1
> latest old version, etc
>
> And in HousekeepStoreAccount.cpp I find the follow calculation of Mark:
>
>                         // Work out ages of this version from the last ma=
rk
>                         int32_t enVersionAge =3D 0;
>                         std::map<std::pair<BackupStoreFilename,
> int32_t>, int32_t>::iterator
> enVersionAgeI(markVersionAges.find(std::pair<BackupStoreFilename,
> int32_t>(en->GetName(), en->GetMarkNumber())));
>                         if(enVersionAgeI !=3D markVersionAges.end())
>                         {
>                                 enVersionAge =3D enVersionAgeI->second + =
1;
>                                 enVersionAgeI->second =3D enVersionAge;
>                         }
>                         else
>                         {
>
> markVersionAges[std::pair<BackupStoreFilename, int32_t>(en->GetName(),
> en->GetMarkNumber())] =3D enVersionAge;
>                         }
>                         // enVersionAge is now the age of this version.
>
> The way I read it is; Assume Mark is 0. If other versions of same file
> is found, sort them and set Mark to the sorted position.

The files are not really sorted but only counted how many newer versions of=
=20
this file exists. On every ScanDirectory we work only on files of this=20
directory. We are running in reverse Order (=3D=3D reverse order of object =
IDs).=20
So we first take newest files until the oldest available file.

We check every file if it is already in markVersionAges list. If it is not=
=20
than we add it. If we find it, than this file is an older version of the fi=
le=20
in the list, so we add 1 to enVersionAgeI->second.

I can not find any code that influences this magic markNumber of a file. So=
 i=20
made an BOX_INFO to en->GetMarkNumber() and it is always 0. I have about a=
=20
half million files in 3 backup accounts but on every housekeeping and every=
=20
file this MarkVersion is always zero. So i think we can ignore this. I do n=
ot=20
know what this is for. Does anybody know this?

We write the calculated enVersionAge to the object DelEn here:

                        d.mVersionAgeWithinMark =3D enVersionAge;

Later we add this element to a sorted list:

                        mPotentialDeletions.insert(d);

That is the only magic here.

Next step is DeleteFiles. This method runs on mPotentialDeletions until eno=
ugh=20
files are deleted.

The surprising thing here for me is, that files are not deleted at the orde=
r=20
oldest first. The compare function for the sorted list mPotentialDeletions=
=20
is "HousekeepStoreAccount::DelEnCompare::operator()". This does not compare=
=20
timestamps but first it compares mVersionAgeWithinMark.

Conclusion: ScanDirectory first deletes files with the highest count of old=
=20
version, oldest first. So i never can guarantee that all changes of last we=
ek=20
are still stored in the backup. If i change one file very often, so perhaps=
=20
version from yesterday is deleted but older versions from another file is=20
still backed up.=20

And that is the problem of old vs deleted files. Soren found the problem.=20
Deleted files have no old version of a file so VersionAge keeps 0. That mea=
ns=20
that deleted files are only removed, if old versions are removed and we are=
=20
still not below softlimit.

> I'm speculating that the problem here is that the latest version of a
> deleted file is 0. This would be wrong; the latest backup version of a
> deleted file is not current. It should be 1 to indicate that it's one
> version different than what is on client disk, where the file is now
> gone from.
>
> A simple change is to bump up the enVersionAge by 1 for deleted files
> to make them comparable in revision to old files.
>
> *** HousekeepStoreAccount.cpp.orig    2008-01-29 09:58:25.000000000 +0900
> --- HousekeepStoreAccount.cpp 2009-02-20 09:54:55.000000000 +0900
> *************** bool HousekeepStoreAccount::ScanDirector
> *** 392,397 ****
> --- 392,398 ----
>                               markVersionAges[std::pair<BackupStoreFilena=
me,
> int32_t>(en->GetName(), en->GetMarkNumber())] =3D enVersionAge;
>                       }
>                       // enVersionAge is now the age of this version.
> +                     if(enFlags &=20
BackupStoreDirectory::Entry::Flags_Deleted)
> ++enVersionAge;
>
>                       // Potentially add it to the list if it's deleted, =
if=20
it's an old
> version or deleted
>                       if((enFlags &=20
(BackupStoreDirectory::Entry::Flags_Deleted |
> BackupStoreDirectory::Entry::Flags_OldVersion)) !=3D 0)
>
> This is just an idea for a quick fix. It's entirely untested, and
> probably wrong if my assumptions are incorrect. For example, I'm not
> sure if enVersionAge is really counting version numbers or holds age
> of file in seconds. I'm also concerned that enVersionAge might be used
> as index to an array, and incrementing it could go out of bounds for
> the array.
>
> That said, is anybody able to test and confirm if this approach would
> fix the problem?

So Sorens assumtion is correct. If we increment enVersionAge of DeletedFile=
s=20
than they are handled like old versions of files. But i think this is not t=
he=20
best solution.

I prefer to delete files depending on their timestamp, oldest first. This a=
lso=20
would only be a little change. The object DelEn must store the timestamp of=
=20
the file from en->GetModificationTime() and in=20
HousekeepStoreAccount::DelEnCompare::operator() we have to delete compariso=
n=20
of mVersionAgeWithinMark and substutude with comparison of the timestamps.

Ideas?

thanks for reading
  Torsten