[Box Backup] BadBackupStoreFile

Chris Wilson boxbackup@fluffy.co.uk
Sun, 9 Sep 2007 16:39:19 +0100 (BST)


Hi Johann,

Sorry for the delay in replying, it's been quite a mad week for me and 
this is a tricky problem you've got here.

On Mon, 3 Sep 2007, Johann Glaser wrote:

>>> 2) slow backup storage mounted via NFS (Iomega StorCenter 150d 2TB)
>>>    The backup usually starts at 3:00 and finishes between 10:00 and
>>>    12:00, so it runs 7-9 hours for approx. 45GB.
>>
>> Do you know if it's IO bound or CPU bound on the client or server or 
>> whether it's the tape device that's slowing it down? How big are the 
>> usual incremental backups in volume of changed files? (you can get that 
>> from the statistics in syslog at the end of each backup run)
>
> The Iomega StorCenter is a NAS, i.e. one large hard disk (to be precise: 
> RAID5) offered via NFS. The limiting factors are both, CPU and IO. When 
> looking at the processes with "top" bbackupd sometimes utilizes nearly 
> 100% CPU, and sometimes only very few. bbstored also sometimes utilizes 
> approx. 20% CPU.

Looks like it's bbackupd's fault then. Certainly the symptoms you're 
seeing of repeatedly reading the same file might well be related to that, 
and the high CPU usage does imply that it's calculating checksums or doing 
some other CPU-intensive work.

Since the client appears to be the limiting factor, what is its CPU and 
memory size?

>
> Syslog says (for the last few days, I've added the thousand-dots, for
> Aug 31 there is no "Connection statistics" line):
> Aug 26 11:10:19 clara bbstored[20310]: Connection statistics for BACKUP-1: IN= 590.499.340 OUT=57301972 TOTAL= 647.801.312
> Aug 26 11:10:19 clara bbackupd[ 1693]: File statistics: total file size uploaded  9.766.762.140, bytes already on server  8.763.266.143, encoded size   588.573.414
> Aug 27 13:06:39 clara bbstored[18602]: Connection statistics for BACKUP-1: IN=3.308.622.563 OUT=66123209 TOTAL=3.374.745.772
> Aug 27 13:06:39 clara bbackupd[ 1693]: File statistics: total file size uploaded 14.685.054.411, bytes already on server 10.514.171.221, encoded size 3.306.372.675
> Aug 28 13:35:49 clara bbstored[16211]: Connection statistics for BACKUP-1: IN=3.143.079.310 OUT=67817968 TOTAL=3.210.897.278
> Aug 28 13:35:49 clara bbackupd[ 1693]: File statistics: total file size uploaded 15.233.073.620, bytes already on server 11.229.233.942, encoded size 3.140.764.015
> Aug 29 13:39:58 clara bbstored[14071]: Connection statistics for BACKUP-1: IN=2.674.544.582 OUT=66673452 TOTAL=2.741.218.034
> Aug 29 13:39:58 clara bbackupd[ 1693]: File statistics: total file size uploaded 13.968.680.100, bytes already on server 10.875.442.949, encoded size 2.672.245.491
> Aug 30 14:08:59 clara bbstored[14387]: Connection statistics for BACKUP-1: IN=2.318.091.276 OUT=73106275 TOTAL=2.391.197.551
> Aug 30 14:08:59 clara bbackupd[ 1693]: File statistics: total file size uploaded 14.853.920.326, bytes already on server 11.349.955.798, encoded size 2.315.752.024
> Aug 31 10:15:01 clara bbackupd[ 1693]: File statistics: total file size uploaded 12.023.532.073, bytes already on server  7.305.642.825, encoded size 3.782.262.988
> Sep  1 15:30:43 clara bbstored[10298]: Connection statistics for BACKUP-1: IN=4.799.606.979 OUT=70868153 TOTAL=4.870.475.132
> Sep  1 15:30:43 clara bbackupd[ 1693]: File statistics: total file size uploaded 17.103.473.401, bytes already on server 10.853.748.045, encoded size 4.797.115.905
> Sep  2 12:45:06 clara bbstored[ 4170]: Connection statistics for BACKUP-1: IN=2.204.990.690 OUT=51000940 TOTAL=2.255.991.630
> Sep  2 12:45:06 clara bbackupd[ 1693]: File statistics: total file size uploaded 11.585.981.608, bytes already on server  7.978.337.448, encoded size 2.203.395.290
>
> Interestingly, it finds ~>10GB every day to be "new" but there should
> only be <<1GB.

I think that "total file size uploaded" should actually read "total file 
size covered by backup". The actual amount uploaded, before compression, 
is the difference between this and "bytes already on server". This seems 
to be 2-7 GB for you each day.

> I suspect the low diff time, which I've now increased to 20.000 seconds. 
> :-)

That may be a bit too long! I wonder how long the diffs take in practice? 
If we're constantly re-reading files then we're certainly wasting a lot of 
time in diffing.

> The StorCenter is attached via a 100MBit LAN. We tested with a 1GB LAN, 
> but the write-speed for large files didn't go above 13MByte/sec. Its 
> main problem is accessing many small files where it seems to have 
> additional delay for every single file.

That would definitely hit Box Backup badly. It's designed for use on hard 
disks and conventional filesystem, where many small files are not a 
problem at all, and Box makes full use of this.

If you're backing up to a device that doesn't meet those expectations, 
then I'm afraid you're going to see awful performance with Box. I really 
don't know if it's worth changing that for this use case alone, given that 
a change to the store format will invalidate all existing stored data for 
all our users.

Have you tried raising the issue with Iomega? If the device has a hard 
disk inside it, this sounds like unreasonable behaviour, i.e. probably a 
bug.

> BTW: The Sesam backup needs 7 minutes for "daily" backups (differential)
> but also has to check a similarely large amount of data.

Because Sesam is designed to work with tapes, it probably doesn't touch 
the tape unless it has to, i.e. it stores all the information locally (on 
disk) to decide what it needs to back up.

It's difficult to tell from their website whether it does incremental 
diffs of changes within files, or whole files, and how much metadata it 
stores. But if it keeps all metadata on a local disk then it could 
certainly be faster than Box.

We might wish to look at that as a possible performance optimisation in 
future (including maybe keeping a copy of the file block checksums 
locally). But for the use case that Box was designed for, i.e. backing up 
over a slow internet connection (rather than a fast LAN) this should not 
be the limiting factor anyway.

Cheers, Chris.
-- 
_____ __     _
\  __/ / ,__(_)_  | Chris Wilson <0000 at qwirx.com> - Cambs UK |
/ (_/ ,\/ _/ /_ \ | Security/C/C++/Java/Perl/SQL/HTML Developer |
\ _/_/_/_//_/___/ | We are GNU-free your mind-and your software |