[Bug 554172] Re: system services not starting at boot
Andy Whitcroft
apw at canonical.com
Wed Aug 11 18:39:48 UTC 2010
> open(2) does not document EIO as a valid return from this function, and
> I'm not even sure this error is appropriate - where it's used elsewhere it
> nearly always refers to a filesystem error - there are few exceptions. If
> the intent is that the calling process should just try again, shouldn't
> it instead return EAGAIN?
Though open(2) does indeed not document this error, it is a documented
POSIX return and it has been possible that this could get returned on
open for a TTY for a very long time. Yes EIO is not a very intuitive
return but actually they chose a different return code as it does indeed
indicate something different than an EAGAIN might. EGAIN generally meaning
"just do it again" and EIO meaning "this is stuck closing at the moment".
> Also please bear in mind that "should" implies that it should somehow
> have been anticipated that the kernel was going to change an interface
> and introduce an undocumented non-transient error code where none existed
> before? :-)
This interface has _not_ changed, an open on a TTY which has recently been
closed has always had the possibility of returning EIO, the /dev/console
device is a TTY and therefore could trigger this behaviour; you have
been lucky up to now. Two things have changed. Firstly, the window in
which it can triggered has widened slightly in the kernel. Secondly,
upstart recently stopped holding /dev/console open in the main thread (to
avoid the REISUB death), holding it open mitigates this issue completely.
(And we might consider this as a mitigation option.)
> Also, let's consider the other effects of this kernel change. For example,
> the following code from the initramfs that actually exec's init in the
> first place:
>
> exec run-init ${rootmnt} ${init} "$@" <${rootmnt}/dev/console >${rootmnt}/dev/console 2>&1
>
> This opens /dev/console to be bound to init's file descriptors, if the
> console has recently been closed, these shell redirects can now fail with
> EIO. That means it's not just init that has to be fixed, it's every single
> possible shell out there, including the shells inside things like busybox?
Actually the race can only be triggered by parallel execution, so for the
init process up to this point we are likely protected by being singly threaded.
If the thread has recently closed the console it will have paid the cost of
closing it before continuing and we are not affected.
> This is why the kernel can't just push its own lazyness down to
> userspace like this.
We commonly hand off unfortunate semantics to userspace and let that
handle things. EINTR is a classic example.
> Another point to consider (I discussed this with a few people here
> at LinuxCon):
>
> open() is supposed to be an inherently blocking system call, just like
> connect(), creat(), etc. If the kernel hasn't finished hanging up the
> tty from last time, it's *okay* for the subsequent open() to block for
> a while while it hangs up the tty and reinitializes it. The app will be
> expecting that.
>
> If the app calls open() with the O_NONBLOCK flag, which it accepts today
> already, then it's a non-blocking open - and in that case it would be
> acceptable for the kernel to fail the open with the EAGAIN or EWOULDBLOCK
> error - *NOT* EIO.
While that is a reasonable position to take, /dev/console is an implicitly
non-blocking device, so in your case because you are using /dev/console
you are getting non-blocking semantics whether you expected them or not.
Also O_NOBLOCK actually does not say that the open should fail EWOULDBLOCK
if it cannot be completed. It means open the device without waiting for it
to be 'connected' it should result in a succesful open.
> (not EIO because it turns out that that error code is already returned
> in some cases to indicate filesystem corruption or disk error, neither
> of which are transient and acceptable to loop on)
It seems that this is predicated only on your dislike of EIO as a return.
Yes it is an unexpected one, but we commonly use error codes to mean
different things from different types of device. EIO is defined as IO
Error, and generally means the IO you wanted to do was not possible. It is
not a big twist to use it to say "open failed because IO is not possible".
Nor for it to mean something completely different on a file and on a TTY.
If it returned an ESLOWCLOSEINPROGRESS or indeed EWOULDBLOCK it would
still be the same semantics, and I suspect you would still not be happy.
Overall I can understand these semantics are not ideal, but they are
the current semantics, they are not new semantics either. Even with
the coming upstream changes (they appear to be merged now), the window
is not gone just reduced and EIO is still a possibility with some TTYs,
some which can be consoles.
I have been doing some research to see if I can find a basis for this
selection of return code and indeed this behaviour; so far I have not
found one. But even if upstream were to concur this is not the correct
behaviour and change it, we are unlikely to have anything concrete in
short order. Upstart is cross distribution so is likely to need some
mitigatation against this behaviour even if we get upstream fixed and
could backport this to Ubuntu's kernels. I would note here that the
patches in question are extensive and not likely to be SRUable.
--
system services not starting at boot
https://bugs.launchpad.net/bugs/554172
You received this bug notification because you are a member of Kernel
Bugs, which is subscribed to linux in ubuntu.
More information about the kernel-bugs
mailing list