Bash script clobbers something vital (lucid)

Sun Nov 20 17:32:58 UTC 2011

On Sat, Nov 19, 2011 at 11:41 PM, Karl Auer <kauer at biplane.com.au> wrote:
> On Sat, 2011-11-19 at 22:47 -0800, Kevin O'Gorman wrote:
>> I already know how to debug, actually
>
> Oh, good. Leave you to it, then.

Thanks.  I could use some help with interpretation of results, however.

I left a shorter version running last night around midnight and it
exhibits the problem, albeit not quite in the way I expected.  It ran
for 2 hours, and quietly died (no error messages) after 2 hours.  The
backup drive is inaccessable, as reported earlier.

There was an earlier suggestion that I should look in syslog.  There
is nothing in syslog for the period the backup was running (except for
the usual crontab reports).  Then at 7:08 AM, there are things
relating to my checking results:

Nov 20 06:47:01 localhost CRON[5514]: (root) CMD (test -x
/usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.weekly ))
Nov 20 06:47:50 localhost ntpd[2273]: kernel time sync status change 2001
Nov 20 07:08:46 localhost kernel: [122104.602699] usb 2-1.1.3: USB
disconnect, address 7
Nov 20 07:08:46 localhost kernel: [122104.975064] usb 2-1.1.3: new
high speed USB device using ehci_hcd and address 10
Nov 20 07:08:46 localhost kernel: [122105.084686] usb 2-1.1.3:
configuration #1 chosen from 1 choice
Nov 20 07:08:46 localhost kernel: [122105.085894] scsi6 : SCSI
emulation for USB Mass Storage devices
Nov 20 07:08:46 localhost kernel: [122105.086063] usb-storage: device
found at 10
Nov 20 07:08:46 localhost kernel: [122105.086066] usb-storage: waiting
for device to settle before scanning
Nov 20 07:08:51 localhost kernel: [122110.081767] usb-storage: device
scan complete
Nov 20 07:08:51 localhost kernel: [122110.082578] scsi 6:0:0:0:
Direct-Access     ST2000DL 003-9VT166            PQ: 0 ANSI: 2 CCS
Nov 20 07:08:51 localhost kernel: [122110.083753] sd 6:0:0:0: Attached
scsi generic sg3 type 0
Nov 20 07:08:51 localhost kernel: [122110.091233] sd 6:0:0:0: [sdd]
3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
Nov 20 07:08:51 localhost kernel: [122110.092948] sd 6:0:0:0: [sdd]
Write Protect is off
Nov 20 07:08:51 localhost kernel: [122110.092955] sd 6:0:0:0: [sdd]
Mode Sense: 00 38 00 00
Nov 20 07:08:51 localhost kernel: [122110.092960] sd 6:0:0:0: [sdd]
Assuming drive cache: write through
Nov 20 07:08:51 localhost kernel: [122110.095139] sd 6:0:0:0: [sdd]
Assuming drive cache: write through
Nov 20 07:08:51 localhost kernel: [122110.095147]  sdd: sdd1
Nov 20 07:08:51 localhost kernel: [122110.104675] sdd: p1 size
4294953054 exceeds device capacity, limited to end of disk
Nov 20 07:08:51 localhost kernel: [122110.106989] sd 6:0:0:0: [sdd]
Assuming drive cache: write through
Nov 20 07:08:51 localhost kernel: [122110.106993] sd 6:0:0:0: [sdd]
Attached SCSI disk
Nov 20 07:09:01 localhost CRON[5547]: (root) CMD (  [ -x
/usr/lib/php5/maxlifetime ] && [ -d /var/lib/php5 ] && find
/var/lib/php5/ -type f -cmin +$(/usr/lib/php5/max+++  lifetime)
-print0 | xargs -n 200 -r -0 rm)
Nov 20 07:16:09 localhost kernel: [122547.231676] EXT4-fs error
(device sdc1): __ext4_get_inode_loc: unable to read inode block -
inode=9308187, block=148897953
Nov 20 07:16:09 localhost kernel: [122547.231702] EXT4-fs error
(device sdc1) in ext4_reserve_inode_write: IO failure
Nov 20 07:16:09 localhost kernel: [122547.231708] EXT4-fs (sdc1):
previous I/O error to superblock detected
Nov 20 07:16:09 localhost kernel: [122547.276335] EXT4-fs error
(device sdc1): ext4_find_entry: reading directory #9308187 offset 0
Nov 20 07:16:09 localhost kernel: [122547.276347] EXT4-fs (sdc1):
previous I/O error to superblock detected
Nov 20 07:16:14 localhost kernel: [122552.608123] end_request: I/O
error, dev sdc, sector 1950781968
Nov 20 07:16:14 localhost kernel: [122552.608135] Aborting journal on
device sdc1-8.
Nov 20 07:16:14 localhost kernel: [122552.608154] JBD2: I/O error
detected when updating journal superblock for sdc1-8.
Nov 20 07:17:01 localhost CRON[5563]: (root) CMD (   cd / && run-parts
--report /etc/cron.hourly)
Nov 20 07:30:01 localhost CRON[5686]: (root) CMD (start -q anacron || :)

The output on the screen is also remarkable.  It ends thus:

 *** Sat Nov 19 23:49:32 PST 2011 Back up root (no home)
 *** Sun Nov 20 01:45:40 PST 2011 Back up subdirectory home (.tgz)
 *** Sun Nov 20 01:58:11 PST 2011 Back up subdirectory usr/local (.tgz)
 *** Sun Nov 20 01:58:13 PST 2011 Show usage after the backup
 *** Sun Nov 20 02:44:47 PST 2011 "./ball.sh 2te" finished on dropkick

real	175m22.010s
user	118m6.667s
sys	7m17.479s

The 01:58:13 entry is from function bkdf.  The one after it is from
bkfinish, almost an hour later.  Presumably, bksumit and bkscripts ran
(silently, which is normal) in between.  This is what I would expect
from a correct run.  However, as you can see from the above syslog,
there was something seriously wrong with access to the backup drive
when I came back to check on it this morning.  As I write this, I'm
leaving the backup drive in this state so that I can figure out how to
have my script detect this state.

I'm presuming the drive will be accessable after a reboot.

Anybody know how to interpret the syslog?

-- 
Kevin O'Gorman, PhD