Bash script clobbers something vital (lucid)
Kevin O'Gorman
kogorman at gmail.com
Sun Nov 20 06:47:15 UTC 2011
On Sat, Nov 19, 2011 at 3:58 PM, Karl Auer <kauer at biplane.com.au> wrote:
> On Sat, 2011-11-19 at 14:21 -0800, Kevin O'Gorman wrote:
>> > Post the script.
>>
>> Attached. It's in three parts
>
> At first blush, I'd say you need to check the inputs more carefully -
> when you are playing around with fdisk and dd, it's essential that the
> parameters are correct. So in bkfuncts.sh, I'd be wrapping some serious
> error checking around those exported variables, especially drive and
> loc. It may not have anything to do with the current problem, but it
> will probably save you somewhere down the track.
Drive is checked against active mountpoints. If it's not mounted, the
script dies. I don't see how to improve on that.
Loc comes directly from hostname(1). It's used to make backup
filenames and to locate bkdropkick.sh in the current case, and other
files on other hosts.
>> everything on the local machine. The problem happens in the middle
>> of this script.
>
> Locating exactly where a bug happens is pretty much the first step to
> fixing it. If the symptom is that the drive is no longer readable, then
> set up a telltale file and check at likely points in your scripts that
> it still exists. If you suddenly can't find it or read it, the failure
> has happened between that point and the last point where you could see
> it. That narrows down the debug space.
>
> If you can reduce the magnitude of the backup while you debug, it will
> speed your debugging - can you set up a virtual with small disks and and
> run all this stuff on the virtual? If it doesn't happen on the virtual,
> that's interesting information too.
I already know how to debug, actually, though my original posting may
not reflect that (3 am if I recall). My current hypothesis is that it
only fails on large workloads, partly because the eventual failure is
of a kind one would normally have been seen and fixed long ago. I'm
currently trying to build up the size from a shortened version that
does not fail. The "bksumit" adds a lot of time, but is my prime
suspect for where the bug is. (My current theory is that "chattr +i
*" causes the problem when there are dirty pages still in cache.)
Once I have a small-sized failure, I'll try the obvious fixes.
>
>> If I comment out all of the commands
>> that worked, running the shell starts computing the md5sum's okay.
>
> If you know what commands worked, presumably you know which command
> didn't... do you know which command failed?
>
I think so, but it needs checking. md5sum and everything after it
that referenced the backup drive. I do know a number of commands that
did not fail, since they left files on the backup volume. Simply
retaining the last such does not cause the error, however, so I must
sneak up on this a little. The runs are long enough that I can only
try a few per day. How nice it is that I'm recently retired...
--
Kevin O'Gorman, PhD
More information about the ubuntu-users
mailing list