[Box Backup] BadBackupStoreFile

Johann Glaser boxbackup@fluffy.co.uk
Wed, 29 Aug 2007 14:33:32 +0200


Am Mittwoch, den 29.08.2007, 09:53 +0100 schrieb Chris Wilson:
> Hi Johann,
> 
> > We have two >2GB files in the backup store. See the bottom of a sorted
> > listing:
> > [...]
> > 748651521 ./cc/02/o1f.rfw
> > 748679825 ./72/03/o8e.rfw
> > 748705025 ./3c/03/oa7.rfw
> > 759350104 ./20/04/o10.rfw
> > 759356296 ./23/06/o12.rfw
> > 759552073 ./9e/06/o8f.rfw
> > 759633897 ./1a/07/obc.rfw
> > 1529736909 ./ba/o55.rfw
> > 1539937214 ./69/01/o05.rfw
> > 2744679666 ./16/01/o5f.rfw
> > 5133609317 ./f2/o44.rfw
> 
> Do you have any errors restoring or comparing the other large file?

Yes, there are errors:
query > compare -E . .
Local file './__db.002/__db.002' has different contents to store file './__db.002'.
Local file './__db.003/__db.003' has different contents to store file './__db.003'.
Local file './__db.004/__db.004' has different contents to store file './__db.004'.
Local file './__db.005/__db.005' has different contents to store file './__db.005'.
Local file './__db.006/__db.006' has different contents to store file './__db.006'.
ERROR: (4/48) during file fetch and comparsion for './strings'
ERROR: (7/41) during file fetch and comparsion for './transactions'
ERROR: (7/41) during file fetch and comparsion for './uuids'

[ 0 (of 5) differences probably due to file modifications after the last upload ]
Differences: 5 (0 dirs excluded, 0 files excluded)

In the directory there are some more files which haven't been mentioned
in the output above. "strings" is the only large file (7.4GB). All other
files in this directory are <55MB. In syslog I got the following errors:

Aug 29 11:59:58 clara bbstored[1703]: Incoming connection from 128.131.80.105 port 40500 (handling in child 25231)
Aug 29 11:59:58 clara bbstored[25231]: Certificate CN: BACKUP-1 
Aug 29 11:59:58 clara bbstored[25231]: Login: Client ID 00000001, Read-only
Aug 29 12:04:16 clara bbstored[25231]: Connection statistics for BACKUP-1: IN=1062 OUT=399083 TOTAL=400145 
Aug 29 12:04:16 clara bbstored[25231]: in server child, exception Connection TLSWriteFailed (Probably a network issue between client and server.) (7/33) -- terminating child
Aug 29 12:05:24 clara bbstored[1703]: Incoming connection from 128.131.80.105 port 39024 (handling in child 28114)
Aug 29 12:05:24 clara bbstored[28114]: Certificate CN: BACKUP-1 
Aug 29 12:05:24 clara bbstored[28114]: Login: Client ID 00000001, Read-only
Aug 29 12:12:44 clara bbstored[1703]: Incoming connection from 128.131.80.105 port 49901 (handling in child 29146)
Aug 29 12:12:44 clara bbstored[29146]: Certificate CN: BACKUP-1 
Aug 29 12:12:44 clara bbstored[29146]: Login: Client ID 00000001, Read-only
Aug 29 12:21:40 clara bbstored[29146]: Session finished
Aug 29 12:21:40 clara bbstored[29146]: Connection statistics for BACKUP-1: IN=992 OUT=1565396 TOTAL=1566388 
Aug 29 12:23:00 clara bbstored[1703]: Incoming connection from 128.131.80.105 port 59254 (handling in child 31071)
Aug 29 12:23:00 clara bbstored[31071]: Certificate CN: BACKUP-1 
Aug 29 12:23:00 clara bbstored[31071]: Login: Client ID 00000001, Read-only
Aug 29 12:39:51 clara bbstored[31071]: Connection statistics for BACKUP-1: IN=417 OUT=31964 TOTAL=32381 
Aug 29 12:39:51 clara bbstored[31071]: in server child, exception Connection Protocol_Timeout (Probably a network issue between client and server.) (7/41) -- terminating child

PID 28114 was still running with nearly 100% CPU when already at the
bbackupquery prompt. Typing "ls" just hang. I had to kill it, just
restarting the boxbackup-server didn't stop this task.

> Of course I agree, but an exception is an exception. It's not supposed to 
> happen, usually no recovery strategy is planned, and it's not even sure 
> that it's safe to continue execution or to continue to use the current 
> connection to the server. In this case it indicates that an internal bug 
> was found.

Yes, indeed. But I want to state that there are cases where an exception
is not an internal bug but another problem, e.g. that a (single) backup
store file was deleted or its permissions changed by somebody playing
around. Therefore there should be some fault tolerance or graceful error
recovery to not endanger the rest of the backup.

> I agree partly, but I think that we shouldn't have to write "backup 
> continues" after every error message. It should be safe to say that if you 
> see a message saying that it stopped because of an exception, then it did, 
> otherwise it didn't. Perhaps we should document that better.

Good idea.

> > In our backup store we have files with hex numbers in their names, e.g.
> > "./3c/05/o95.rfw". How can I find out to which file on the backup client
> > that belongs? Or (the other way round): how can I find out which backup
> > store files belong to a particular file on the client?
> 
> The path name is converted to ID by taking the two hex digits from each 
> component, reversing the order (most significant byte is the last one, 
> before ".rfw") and padding with zeroes on the left. So, for example, 
> ./cc/02/o1f.rfw is 001f02cc. (I think that's right anyway).

I found its a bit more complicated. For two levels above mentioned
file ./f2/o44.rfw belongs to the ID 0000f244. For three levels the
translation is ./xx/yy/ozz.rfw -> ID=00yyxxzz.

> You can compare those IDs to the ones given in the remote directory 
> listings in bbackupquery, but unfortunately there isn't a global reverse 
> mapping so you need to manually hunt through directories to find them, 
> sorry.

Hehe, thats a good point to mention a feature request. :-)

Another one: please add the feature to use "list" with filenames, e.g.
  list -dots *.txt
and with subdirectories
  list -dots db/
and combined
  list -dots */db/*.txt

> > Another question: In our backup store we have lots of files with
> > identical size but different md5sums, e.g.
> > 532869699 ./3d/05/o6a.rfw
> > 532869699 ./42/04/of3.rfw
> > 532869699 ./46/06/o24.rfw
> > 532869699 ./8d/04/o89.rfw
> > 532869699 ./94/03/o45.rfw
> > 532869699 ./96/o81.rfw
> > 532869699 ./a6/07/o00.rfw
> > 532869699 ./c2/06/o38.rfw
> > 532869699 ./ca/05/oa9.rfw
> > 532869699 ./d8/04/o19.rfw
> > There are several such runs. I assume that these all belong to the same
> > original file on the client. So, my question is, if a large file changes
> > slightly (e.g. gets a bit longer, or a few (kilo)bytes are modified),
> > will the backup store then hold only the difference (like SVN) or the
> > whole new plus the whole old versions of the file?
> 
> Normally it will only hold the difference. It would be interesting to know 
> if these are really multiple copies of old files, and if so why that 
> happened.

For some nearly-equally-sized files in the backup store I found that
they belong to the very same file on the client, so they represent
different versions. When looking with bbackupquery at old versions, all
of them have similar large size.

> In order to calculate the difference, bbackupd will perform a diff of the 
> file against a downloading stream. Since this could take a long time, 
> there is a built-in timeout (MaximumDiffingTime) and once that time is 
> exceeded, bbackupd will upload an entire new copy. If these are copies 
> of the same file, you might like to increase that timeout.

Unfortunately we backup on an external storage (Iomega StorCenter 150D)
connected with NFS over 100MBit/s which is _extremely_ slow, especially
for directory listings. So such timeouts might well be the problem. 

I found this option in bbackupd.conf. Which unit is used for the time?
Seconds? Milliseconds?

Another feature request: The backup server should (additionally) store
checksums across large blocks of files, e.g. 1MB blocks. Then only these
checksums need to be read from disk instead of the whole backup file.

Bye
  Hansi