Bash script clobbers something vital (lucid)

Tue Nov 22 03:42:01 UTC 2011

Synopsis: not a bug, but a hardware problem.

On Sun, Nov 20, 2011 at 9:32 AM, Kevin O'Gorman <kogorman at gmail.com> wrote:
> On Sat, Nov 19, 2011 at 11:41 PM, Karl Auer <kauer at biplane.com.au> wrote:
>> On Sat, 2011-11-19 at 22:47 -0800, Kevin O'Gorman wrote:
>>> I already know how to debug, actually
>>
>> Oh, good. Leave you to it, then.
>
> Thanks.  I could use some help with interpretation of results, however.
>
> I left a shorter version running last night around midnight and it
> exhibits the problem, albeit not quite in the way I expected.  It ran
> for 2 hours, and quietly died (no error messages) after 2 hours.  The
> backup drive is inaccessable, as reported earlier.
>
> There was an earlier suggestion that I should look in syslog.  There
> is nothing in syslog for the period the backup was running (except for
> the usual crontab reports).  Then at 7:08 AM, there are things
> relating to my checking results:
>
> Nov 20 06:47:01 localhost CRON[5514]: (root) CMD (test -x
> /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.weekly ))
> Nov 20 06:47:50 localhost ntpd[2273]: kernel time sync status change 2001
> Nov 20 07:08:46 localhost kernel: [122104.602699] usb 2-1.1.3: USB
> disconnect, address 7
> Nov 20 07:08:46 localhost kernel: [122104.975064] usb 2-1.1.3: new
> high speed USB device using ehci_hcd and address 10
> Nov 20 07:08:46 localhost kernel: [122105.084686] usb 2-1.1.3:
> configuration #1 chosen from 1 choice
> Nov 20 07:08:46 localhost kernel: [122105.085894] scsi6 : SCSI
> emulation for USB Mass Storage devices
> Nov 20 07:08:46 localhost kernel: [122105.086063] usb-storage: device
> found at 10
> Nov 20 07:08:46 localhost kernel: [122105.086066] usb-storage: waiting
> for device to settle before scanning
> Nov 20 07:08:51 localhost kernel: [122110.081767] usb-storage: device
> scan complete
> Nov 20 07:08:51 localhost kernel: [122110.082578] scsi 6:0:0:0:
> Direct-Access     ST2000DL 003-9VT166            PQ: 0 ANSI: 2 CCS
> Nov 20 07:08:51 localhost kernel: [122110.083753] sd 6:0:0:0: Attached
> scsi generic sg3 type 0
> Nov 20 07:08:51 localhost kernel: [122110.091233] sd 6:0:0:0: [sdd]
> 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
> Nov 20 07:08:51 localhost kernel: [122110.092948] sd 6:0:0:0: [sdd]
> Write Protect is off
> Nov 20 07:08:51 localhost kernel: [122110.092955] sd 6:0:0:0: [sdd]
> Mode Sense: 00 38 00 00
> Nov 20 07:08:51 localhost kernel: [122110.092960] sd 6:0:0:0: [sdd]
> Assuming drive cache: write through
> Nov 20 07:08:51 localhost kernel: [122110.095139] sd 6:0:0:0: [sdd]
> Assuming drive cache: write through
> Nov 20 07:08:51 localhost kernel: [122110.095147]  sdd: sdd1
> Nov 20 07:08:51 localhost kernel: [122110.104675] sdd: p1 size
> 4294953054 exceeds device capacity, limited to end of disk
> Nov 20 07:08:51 localhost kernel: [122110.106989] sd 6:0:0:0: [sdd]
> Assuming drive cache: write through
> Nov 20 07:08:51 localhost kernel: [122110.106993] sd 6:0:0:0: [sdd]
> Attached SCSI disk
> Nov 20 07:09:01 localhost CRON[5547]: (root) CMD (  [ -x
> /usr/lib/php5/maxlifetime ] && [ -d /var/lib/php5 ] && find
> /var/lib/php5/ -type f -cmin +$(/usr/lib/php5/max+++  lifetime)
> -print0 | xargs -n 200 -r -0 rm)
> Nov 20 07:16:09 localhost kernel: [122547.231676] EXT4-fs error
> (device sdc1): __ext4_get_inode_loc: unable to read inode block -
> inode=9308187, block=148897953
> Nov 20 07:16:09 localhost kernel: [122547.231702] EXT4-fs error
> (device sdc1) in ext4_reserve_inode_write: IO failure
> Nov 20 07:16:09 localhost kernel: [122547.231708] EXT4-fs (sdc1):
> previous I/O error to superblock detected
> Nov 20 07:16:09 localhost kernel: [122547.276335] EXT4-fs error
> (device sdc1): ext4_find_entry: reading directory #9308187 offset 0
> Nov 20 07:16:09 localhost kernel: [122547.276347] EXT4-fs (sdc1):
> previous I/O error to superblock detected
> Nov 20 07:16:14 localhost kernel: [122552.608123] end_request: I/O
> error, dev sdc, sector 1950781968
> Nov 20 07:16:14 localhost kernel: [122552.608135] Aborting journal on
> device sdc1-8.
> Nov 20 07:16:14 localhost kernel: [122552.608154] JBD2: I/O error
> detected when updating journal superblock for sdc1-8.
> Nov 20 07:17:01 localhost CRON[5563]: (root) CMD (   cd / && run-parts
> --report /etc/cron.hourly)
> Nov 20 07:30:01 localhost CRON[5686]: (root) CMD (start -q anacron || :)
>
> The output on the screen is also remarkable.  It ends thus:
>
>  *** Sat Nov 19 23:49:32 PST 2011 Back up root (no home)
>  *** Sun Nov 20 01:45:40 PST 2011 Back up subdirectory home (.tgz)
>  *** Sun Nov 20 01:58:11 PST 2011 Back up subdirectory usr/local (.tgz)
>  *** Sun Nov 20 01:58:13 PST 2011 Show usage after the backup
>  *** Sun Nov 20 02:44:47 PST 2011 "./ball.sh 2te" finished on dropkick
>
> real    175m22.010s
> user    118m6.667s
> sys     7m17.479s
>
> The 01:58:13 entry is from function bkdf.  The one after it is from
> bkfinish, almost an hour later.  Presumably, bksumit and bkscripts ran
> (silently, which is normal) in between.  This is what I would expect
> from a correct run.  However, as you can see from the above syslog,
> there was something seriously wrong with access to the backup drive
> when I came back to check on it this morning.  As I write this, I'm
> leaving the backup drive in this state so that I can figure out how to
> have my script detect this state.
>
> I'm presuming the drive will be accessable after a reboot.
>
> Anybody know how to interpret the syslog?

Sorry for the excitement.  The errors were not consistent, so trying
everything, I found that the problem disappeared when I changed USB
ports and cables.

-- 
Kevin O'Gorman, PhD