[Oneiric-Topic] Server Boot

Scott Kitterman ubuntu at kitterman.com
Wed Mar 30 15:26:43 UTC 2011


On Wednesday, March 30, 2011 11:21:04 AM Alvin wrote:
> On Wednesday 30 March 2011 14:52:14 Serge E. Hallyn wrote:
> > Quoting Scott Kitterman (ubuntu at kitterman.com):
> > > There was a lot of discussion around improving the server boot
> > > experience before the UDS-M.  A number of people expressed interest in
> > > seeing more useful diagnostic information during boot.  Others
> > > expressed concerns with boot reliability on the more complex hardware
> > > typically found in servers.
> > > 
> > > How are we doing on this?  Personally, I can't remember the last time I
> > > rebooted a server and it wasn't via SSH and the hardware I use is the
> > > sort there were problems with.  Are these still issues for the Ubuntu
> > > Server community?
> > > 
> > > Scott K
> > 
> > I think right now these issues are oveshadowed by the fact that a
> > great deal of server software is not yet upstartified.  I think that
> > needs to be addressed for O.
> 
> Yes, they are certainly still issues (and the primary reason the company I
> work for is abandoning Ubuntu.)
> 
> I agree that a lot of servers are not often rebooted, but not every server
> is a webserver. Some are used only during certain hours and can be booted
> automatically (BIOS or WOL) when needed in order to keep the electricity
> bill down. Booting should be a reliable and automated process. Accurate
> logging is important in order to know what went wrong in case the
> unthinkable happens.
> 
> The current boot.log looks like:
> > mount.nfs: DNS resolution failed for 192.168.xxx.3: Name or service not
> 
> known
> 
> > mount.nfs4: Failed to resolve server exampleserver: Name or service not
> 
> known
> 
> > mountall: mount /srv/example [1134] terminated with status 32
> > mount error(101): Network is unreachable
> 
> while in reality filesystems are mounted. Now, when something goes wrong,
> the log is identical. conclusion: boot.log is useless. (actually, the log
> is probably correct. it can't resolve server names at that specific time.)
> Proper boot logging would be popular[1].
> 
> Take the following example of a server boot. Let's also assume that nothing
> goes wrong that could lead to a busybox console. (It certainly can![2][3])
> So, you're now sitting in front of a nice prompt. Everything looks ok, but
> is it? The server mounts NFS shares from another server, it runs
> KVM/libvirt with a netfs storage pool for its virtual machines and a
> quasselcore for IRC that stores it's data on a postgresql on another
> server. The local filesystem uses mdadm for RAID1 and LVM on op of that.
> Very server-like. (I once made this setup to test some things.) In order
> to keep things under control, there are /no/ LVM snapshots. That is
> another ugly story.
> 
> So, what happens now:
> - The RAID will be broken! [4][5]
> - The NFS shares in /etc/fstab might not be mounted, [6][7]
>   even when you told the system to wait with _netdev. [8]
> - Your virtual machines on netfs will not be running. [9]
> - The quasselcore with external db will not be started. [10]
> 
> The array can be assembled by running a command and all of the above
> daemons can be started manually.
> 
> I talked about some of those topics on IRC, and the following workarounds
> came up. There are also some workarounds in the bug reports.
> - Put NFS shares in /etc/fstab, and don't configure them as netfs storage
> pools.
> - Put the IP addresses of your NFS servers in /etc/hosts.
> 
> For most servers, speeding up the boot process is less important than
> reliability. Why not take a look at how Debian does it? You can disable
> running the boot scripts in parallel with 'CONCURRENCY=none' in
> /etc/default/rcS.
> 
> Also, think about daemons of commercial software without upstart scripts.
> You never know whether they will start at boot or not.
> 
> Links:
> [1] "init: support logging of job output"
>     https://bugs.launchpad.net/bugs/328881
> 
> [2] "Gave up waiting for root device after upgrade then busybox console"
>     https://bugs.launchpad.net/bugs/360378
> 
> [3] "karmic rc: root device sometimes not found"
>     https://bugs.launchpad.net/bugs/460914
> 
> [4] mdadm cannot assemble array as cannot open drive with O_EXCL
>     https://bugs.launchpad.net/bugs/27037
> 

> [5] "mdadm cannot assemble array"
>     https://bugs.launchpad.net/bugs/599135
> 
> [6] "nfs mounts specified in fstab is not mounted on boot."
>     https://bugs.launchpad.net/bugs/275451
> 
> [7] "nfs shares are not automounted anymore in intrepid"
>     https://bugs.launchpad.net/bugs/285013
> 
> [8] "_netdev not working"
>     https://bugs.launchpad.net/bugs/384347
> 
> [9] "Libvirt NFS mount on boot."
>     https://bugs.launchpad.net/bugs/351307
> 
> [10] "quasselcore does not connect to database at boot"
>      https://bugs.launchpad.net/bugs/612729

This is exactly the kind of detailed feedback I was hoping to get.  Thank you.  
I suspect we'll need to have several UDS sessions around server boot in order 
to lay out a comprehensive plan of attack.  The release before the next LTS is 
definitely the cycle to hit this.

Scott K




More information about the ubuntu-server mailing list