[Box Backup] Old vs Deleted files removal
Mon, 27 Jul 2009 19:25:22 +0200
sorry for the additional spam, but perhaps nobody can see my mail because o=
threaded view. I would like to have comments ...
Because of my keep-files-28-days patch i tried to understand the logic of=20
Housekeeping and to solve the problem old vs deleted files. Here is my=20
Am Friday 20 February 2009 02:53:45 schrieb S=C3=B8ren D=C3=B8ssing:
> On Thu, Feb 19, 2009 at 8:32 AM, Chris Wilson <firstname.lastname@example.org> wrote:
> > I think all the action happens in bin/bbstored/HousekeepStoreAccount.cp=
> > It starts in HousekeepStoreAccount::DoHousekeeping(), but that doesn't =
> > much. The real action I suspect is in
> > HousekeepStoreAccount::ScanDirectory() which is called on the root
> > directory and calls itself recursively, building up a "map to count the
> > distance from the mark" as it goes. That's where I get lost, I don't kn=
> > what the "mark" is or what's done with this list yet.
> Chris, thanks for your explanation. I looked inside the source code to
> see if I could find out what was going on. I think I got at least one
> step further towards a reasonable solution.
> In HousekeepStoreAccount.h I find the following explanation of Mark:
> int32_t mVersionAgeWithinMark; // 0 =3D=3D current, 1
> latest old version, etc
> And in HousekeepStoreAccount.cpp I find the follow calculation of Mark:
> // Work out ages of this version from the last ma=
> int32_t enVersionAge =3D 0;
> int32_t>, int32_t>::iterator
> int32_t>(en->GetName(), en->GetMarkNumber())));
> if(enVersionAgeI !=3D markVersionAges.end())
> enVersionAge =3D enVersionAgeI->second + =
> enVersionAgeI->second =3D enVersionAge;
> markVersionAges[std::pair<BackupStoreFilename, int32_t>(en->GetName(),
> en->GetMarkNumber())] =3D enVersionAge;
> // enVersionAge is now the age of this version.
> The way I read it is; Assume Mark is 0. If other versions of same file
> is found, sort them and set Mark to the sorted position.
The files are not really sorted but only counted how many newer versions of=
this file exists. On every ScanDirectory we work only on files of this=20
directory. We are running in reverse Order (=3D=3D reverse order of object =
So we first take newest files until the oldest available file.
We check every file if it is already in markVersionAges list. If it is not=
than we add it. If we find it, than this file is an older version of the fi=
in the list, so we add 1 to enVersionAgeI->second.
I can not find any code that influences this magic markNumber of a file. So=
made an BOX_INFO to en->GetMarkNumber() and it is always 0. I have about a=
half million files in 3 backup accounts but on every housekeeping and every=
file this MarkVersion is always zero. So i think we can ignore this. I do n=
know what this is for. Does anybody know this?
We write the calculated enVersionAge to the object DelEn here:
d.mVersionAgeWithinMark =3D enVersionAge;
Later we add this element to a sorted list:
That is the only magic here.
Next step is DeleteFiles. This method runs on mPotentialDeletions until eno=
files are deleted.
The surprising thing here for me is, that files are not deleted at the orde=
oldest first. The compare function for the sorted list mPotentialDeletions=
is "HousekeepStoreAccount::DelEnCompare::operator()". This does not compare=
timestamps but first it compares mVersionAgeWithinMark.
Conclusion: ScanDirectory first deletes files with the highest count of old=
version, oldest first. So i never can guarantee that all changes of last we=
are still stored in the backup. If i change one file very often, so perhaps=
version from yesterday is deleted but older versions from another file is=20
still backed up.=20
And that is the problem of old vs deleted files. Soren found the problem.=20
Deleted files have no old version of a file so VersionAge keeps 0. That mea=
that deleted files are only removed, if old versions are removed and we are=
still not below softlimit.
> I'm speculating that the problem here is that the latest version of a
> deleted file is 0. This would be wrong; the latest backup version of a
> deleted file is not current. It should be 1 to indicate that it's one
> version different than what is on client disk, where the file is now
> gone from.
> A simple change is to bump up the enVersionAge by 1 for deleted files
> to make them comparable in revision to old files.
> *** HousekeepStoreAccount.cpp.orig 2008-01-29 09:58:25.000000000 +0900
> --- HousekeepStoreAccount.cpp 2009-02-20 09:54:55.000000000 +0900
> *************** bool HousekeepStoreAccount::ScanDirector
> *** 392,397 ****
> --- 392,398 ----
> int32_t>(en->GetName(), en->GetMarkNumber())] =3D enVersionAge;
> // enVersionAge is now the age of this version.
> + if(enFlags &=20
> // Potentially add it to the list if it's deleted, =
it's an old
> version or deleted
> if((enFlags &=20
> BackupStoreDirectory::Entry::Flags_OldVersion)) !=3D 0)
> This is just an idea for a quick fix. It's entirely untested, and
> probably wrong if my assumptions are incorrect. For example, I'm not
> sure if enVersionAge is really counting version numbers or holds age
> of file in seconds. I'm also concerned that enVersionAge might be used
> as index to an array, and incrementing it could go out of bounds for
> the array.
> That said, is anybody able to test and confirm if this approach would
> fix the problem?
So Sorens assumtion is correct. If we increment enVersionAge of DeletedFile=
than they are handled like old versions of files. But i think this is not t=
I prefer to delete files depending on their timestamp, oldest first. This a=
would only be a little change. The object DelEn must store the timestamp of=
the file from en->GetModificationTime() and in=20
HousekeepStoreAccount::DelEnCompare::operator() we have to delete compariso=
of mVersionAgeWithinMark and substutude with comparison of the timestamps.
thanks for reading