[Box Backup] BadBackupStoreFile

Johann Glaser boxbackup@fluffy.co.uk
Fri, 31 Aug 2007 10:48:06 +0200


Hi!

> > Yes, there are errors:
> > query > compare -E . .
> > Local file './__db.002/__db.002' has different contents to store file './__db.002'.
> > Local file './__db.003/__db.003' has different contents to store file './__db.003'.
> > Local file './__db.004/__db.004' has different contents to store file './__db.004'.
> > Local file './__db.005/__db.005' has different contents to store file './__db.005'.
> > Local file './__db.006/__db.006' has different contents to store file './__db.006'.
> ...
> > [ 0 (of 5) differences probably due to file modifications after the last upload ]
> > Differences: 5 (0 dirs excluded, 0 files excluded)
> 
> Are these errors expected, i.e. did those files change since the last 
> backup? The message seems to indicate that they did not, and therefore 
> another possible bug in 0.10. But it's also possible that Subversion or 
> BDB manually changes the timestamps on these files, rendering Box Backups' 
> timestamp comparison useless.

Before backing up SVN files I do a hotcopy and only backup these
"exported" databases. So these don't change any more because SVN doesn't
know of its copies.

bbackupquery says the date of the above mentioned files is the same as
the date in the file system.

BTW: bbackupquery gives the time in (i guess) UTC while in the
filesystem we have +2h (Austria, Vienna, DST). E.g. for __db.002
bbackupquery says "2007-08-10T01:12:19" while the filesystem (using "ls
-l" says "2007-08-10 03:12". The same holds for some other files I've
randomly checked.

> > ERROR: (4/48) during file fetch and comparsion for './strings'
> > ERROR: (7/41) during file fetch and comparsion for './transactions'
> > ERROR: (7/41) during file fetch and comparsion for './uuids'
> 
> The 7/41 errors are a symptom of a broken connection (loss of 
> synchronisation) after the comparison for ./strings failed, which is 
> expected (unfortunately). Please could you try to identify the other large 
> file and to compare it separately, to see if you get a 4/48 error? (I'd 
> expect so).

Please see below, I got the same error. I have to mention that all these
tests were accomplished while the daily backup run was active (it
usually lasts from 3:00-

> > In the directory there are some more files which haven't been mentioned
> > in the output above. "strings" is the only large file (7.4GB). All other
> > files in this directory are <55MB.
> 
> Any idea, then, what the other file over 2GB is? (./f2/o44.rfw)

./f2/o44.rfw    is .../linux/db/strings    (7.5GiB -> 5.1GiB)
./16/01/o5f.rfw is .../qm/db/strings       (3.1GiB -> 2.7GiB)
./69/01/o05.rfw is .../docstore/db/strings (2.1GiB -> 1.5GiB)

I started a "compare -E . ." in .../docstore/db/ which prints the
following:
query > compare -E . .
Local file './__db.001' does not exist, but store file './__db.001' does.
Local file './__db.002' does not exist, but store file './__db.002' does.
Local file './__db.003' does not exist, but store file './__db.003' does.
Local file './__db.004' does not exist, but store file './__db.004' does.
Local file './__db.005' does not exist, but store file './__db.005' does.
Local file './__db.006' does not exist, but store file './__db.006' does.
Local file './__db.register' does not exist, but store file './__db.register' does.
Local file './log.0000005287' does not exist, but store file './log.0000005287' does.
Local file './log.0000005288' does not exist, but store file './log.0000005288' does.
ERROR: (4/48) during file fetch and comparsion for './strings'
ERROR: (7/41) during file fetch and comparsion for './transactions'
ERROR: (7/41) during file fetch and comparsion for './uuids'

[ 0 (of 9) differences probably due to file modifications after the last
upload ]
Differences: 9 (0 dirs excluded, 0 files excluded)

The files which don't exist really don't exist (any more?, probably the
currently running backup didn't come across this directory yet). The
three Error lines were printed with several minutes delay between them.

> > PID 28114 was still running with nearly 100% CPU when already at the 
> > bbackupquery prompt. Typing "ls" just hang. I had to kill it, just 
> > restarting the boxbackup-server didn't stop this task.
> 
> That's really bad, sorry. Can you reproduce this?

Now (after the above error 4/48 not when starting bbackupquery it prints

Box Backup Query Tool v0.10, (c) Ben Summers and contributors 2003-2006
Using configuration file /etc/boxbackup/bbackupd.conf
Connecting to store...
Handshake with store...
Login to store...

and hangs.

> The cases that you mention should not cause an exception to be thrown, but 
> rather a recoverable error condition. If you think that they are aborting 
> the backup, then I'd really appreciate your help to find out why.

The problem is that I can't check wbether such conditions stop the
backup or not. That was why I've written my concerns a few emails
before.

> What Box Backup documentation have you read so far? Do you have an idea 
> where the best place to document this would be, so that you would have 
> found it if it existed?

The main documentation was an installation guide at a German Linux news
site (http://www.pro-linux.de/berichte/boxbackup.html). I've also
visited the BoxBackup main site and its Wiki. I'm not really sure but I
remember there was no explicit documentation about logging, so I'd
suggest to write such a section.

> >> The path name is converted to ID by taking the two hex digits from each 
> >> component, reversing the order (most significant byte is the last one, 
> >> before ".rfw") and padding with zeroes on the left. So, for example, 
> >> ./cc/02/o1f.rfw is 001f02cc. (I think that's right anyway).
> >
> > I found its a bit more complicated. For two levels above mentioned file 
> > ./f2/o44.rfw belongs to the ID 0000f244. For three levels the 
> > translation is ./xx/yy/ozz.rfw -> ID=00yyxxzz.
> 
> OK, sorry, you learn something every day :-)

Could you please provide a tool (e.g. as a new command for bbackupquery)
to do this translation in both directions? I wanted to write this to the
feature request Wiki but didn't have an account.

> > For some nearly-equally-sized files in the backup store I found that
> > they belong to the very same file on the client, so they represent
> > different versions. When looking with bbackupquery at old versions, all
> > of them have similar large size.
> ...
> > Unfortunately we backup on an external storage (Iomega StorCenter 150D)
> > connected with NFS over 100MBit/s which is _extremely_ slow, especially
> > for directory listings. So such timeouts might well be the problem.
> >
> > I found this option in bbackupd.conf. Which unit is used for the time?
> > Seconds? Milliseconds?
> 
> Units are seconds. Where did you look for this information, i.e. where 
> should we improve the documentation?

I'd suggest directly in the config file comments.

Currently the value is 20 sec. which is really small. I assume this time
is the wall clock time from starting the diff until giving up, including
all delays while the daemon waits for the transfer of the backup store
file, reading the real file, ...

I suggest to measure this time only as real CPU usage for the diff
algorithm, so that the data transfer delays, ... are excluded. Or even
better: make it configurable, e.g. by introducing a second config
variable called MaximumDiffingCpuTime which is checked as mentioned. If
one of the two variables is set to 0 no limit is applied, of both are
set to >0, the limit which is reached first is applied.

A note about the diff algorithm: The "diff" tool provided by my Linux
distribution wants to load both files into memory and then does the
diff. This is bad for large files due to memory consumption. Usually the
differences of files are either huge or tiny, so it should be enough to
have a several MB sliding window. I ask to use such an algorithm instead
of the whole-file-approach (if it is not already done) for Box Backup.

Bye
  Hansi