[Bug 1636708] Re: ifup -a does not start dependants last, causes deadlocks with vlans/bonding
Ryan Harper
1636708 at bugs.launchpad.net
Tue May 9 16:38:12 UTC 2017
On Tue, May 9, 2017 at 10:32 AM, Dimitri John Ledkov <launchpad at surgut.co.uk
> wrote:
> Maybe this is easier to read.
>
> root at xnox-iad-nr5:~# journalctl -o short-monotonic -u ifup at bond0.service
> -- Logs begin at Tue 2017-05-09 10:57:18 UTC, end at Tue 2017-05-09
> 15:22:34 UTC. --
> [ 6.740201] xnox-iad-nr5 systemd[1]: Started ifup for bond0.
> [ 6.750333] xnox-iad-nr5 sh[1184]: Waiting for a slave to join bond0
> (will timeout after 60s)
> [ 6.987241] xnox-iad-nr5 sh[1184]: ifquery: recursion detected for
> interface bond0 in pre-up phase
> [ 6.987341] xnox-iad-nr5 sh[1184]: ifquery: recursion detected for
> parent interface bond0 in pre-up phase
> [ 6.987425] xnox-iad-nr5 sh[1184]: ifquery: recursion detected for
> parent interface bond0 in pre-up phase
> root at xnox-iad-nr5:~# journalctl -o short-monotonic -u
> ifup at bond0.101.service
> -- Logs begin at Tue 2017-05-09 10:57:18 UTC, end at Tue 2017-05-09
> 15:22:34 UTC. --
> [ 6.755723] xnox-iad-nr5 systemd[1]: Started ifup for bond0.101.
> [ 6.757056] xnox-iad-nr5 sh[1286]: ifup: waiting for lock on
> /run/network/ifstate.bond0
> [ 7.227572] xnox-iad-nr5 sh[1286]: Set name-type for VLAN subsystem.
> Should be visible in /proc/net/vlan/config
> root at xnox-iad-nr5:~# journalctl -o short-monotonic -u
> ifup at bond0.401.service
> -- Logs begin at Tue 2017-05-09 10:57:18 UTC, end at Tue 2017-05-09
> 15:22:34 UTC. --
> [ 6.760568] xnox-iad-nr5 systemd[1]: Started ifup for bond0.401.
> [ 6.761920] xnox-iad-nr5 sh[1290]: ifup: waiting for lock on
> /run/network/ifstate.bond0
> [ 7.197983] xnox-iad-nr5 sh[1290]: Set name-type for VLAN subsystem.
> Should be visible in /proc/net/vlan/config
> root at xnox-iad-nr5:~# journalctl -o short-monotonic -u networking.service
> -- Logs begin at Tue 2017-05-09 10:57:18 UTC, end at Tue 2017-05-09
> 15:22:34 UTC. --
> [ 6.645323] xnox-iad-nr5 systemd[1]: Starting Raise network
> interfaces...
> [ 6.692530] xnox-iad-nr5 ifup[992]: Waiting for bonding kernel module
> to be ready (will timeout after 5s)
> [ 6.693104] xnox-iad-nr5 ifup[992]: Waiting for bond master bond0 to be
> ready
> [ 7.221867] xnox-iad-nr5 ifup[992]: /sbin/ifup: waiting for lock on
> /run/network/ifstate.bond0
> [ 7.263179] xnox-iad-nr5 systemd[1]: Started Raise network interfaces.
>
>
What is the status/timestamp of the physical devices (enspf90 and 91) ?
Note that networking, ifup at bond0.101, and ifup at bond0.401 all hit locks
that are held by bond0.
> Note that ifup at bond0.service beats networking.service to acquire the lock.
>
hrm, I didn't think we got ifup events for non-physical devices; this is
certainly problematic;
the design, IIUC, is for physical devices to get ifup at .service triggers via
udev events; see the the BindsTo=sys-subsystem-net-devices-%i.device
The result in this boot is that networking.service (which starts at 6.64
seconds calls ifup -a (pid 992) which is racing with
the ifup at bond0 (started at 6.74, pid 1184); so we have *two* ifups locking
bond0; I think this is the real deadlock.
I'm not sure why we're getting an ifup at bond0 unit execution, the udev
scripts are only triggered for physical devices;
Looks like when the module is loaded (modprobe bonding) that we get an
event for bond0 ... which may be triggering
the hotplug udev-rule which queues a ifup at bond0 job...
This is strange (to me):
# modprobe -vr bonding
rmmod bonding
root at x1:~# ifconfig -a
ens3 Link encap:Ethernet HWaddr 52:54:00:87:54:74
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:2 errors:0 dropped:0 overruns:0 frame:0
TX packets:2 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:104 (104.0 B) TX bytes:184 (184.0 B)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:320 errors:0 dropped:0 overruns:0 frame:0
TX packets:320 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:23680 (23.6 KB) TX bytes:23680 (23.6 KB)
root at x1:~# cat /etc/issue
Ubuntu 16.04.2 LTS \n \l
root at x1:~# lsb_release -rd
Description: Ubuntu 16.04.2 LTS
Release: 16.04
root at x1:~# ls -al /sys/class/net/
total 0
drwxr-xr-x 2 root root 0 May 9 16:36 .
drwxr-xr-x 63 root root 0 May 9 16:35 ..
lrwxrwxrwx 1 root root 0 May 9 16:35 ens3 ->
../../devices/pci0000:00/0000:00:03.0/virtio0/net/ens3
lrwxrwxrwx 1 root root 0 May 9 16:35 lo -> ../../devices/virtual/net/lo
root at x1:~# modprobe -v bonding
insmod /lib/modules/4.4.0-75-generic/kernel/drivers/net/bonding/bonding.ko
root at x1:~# ls -al /sys/class/net/
total 0
drwxr-xr-x 2 root root 0 May 9 16:36 .
drwxr-xr-x 63 root root 0 May 9 16:36 ..
lrwxrwxrwx 1 root root 0 May 9 16:36 bond0 ->
../../devices/virtual/net/bond0
-rw-r--r-- 1 root root 4096 May 9 16:36 bonding_masters
lrwxrwxrwx 1 root root 0 May 9 16:36 ens3 ->
../../devices/pci0000:00/0000:00:03.0/virtio0/net/ens3
lrwxrwxrwx 1 root root 0 May 9 16:36 lo -> ../../devices/virtual/net/lo
root at x1:~# uname -r
4.4.0-75-generic
Why would bond0 get created when loading the bonding module?
@xnox, if you change your interfaces from bond0, to bond1 globally, does
your configuration
work ?
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1636708
>
> Title:
> ifup -a does not start dependants last, causes deadlocks with
> vlans/bonding
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/ifupdown/+bug/
> 1636708/+subscriptions
>
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to ifupdown in Ubuntu.
https://bugs.launchpad.net/bugs/1636708
Title:
ifup -a does not start dependants last, causes deadlocks with
vlans/bonding
Status in ifupdown package in Ubuntu:
Confirmed
Status in ifupdown source package in Xenial:
New
Bug description:
This is a problem I've been struggling with since moving to 16.04.1
from 14.04 (fresh install)
I don't believe this problem affected 14.04. I have used an almost
identical interfaces file on 14.04 without problem.
On 16.04.1, however, 9/10 boots would hang during network
configuration and leave the network incorrectly configured.
When calling "ifup -a" all candidate interfaces appear to be started
in parallel leading to collisions with locks. This causes hanging
(until timeout) during booting and the network interfaces left
incorrectly configured
Imagine this /etc/network/interfaces
auto eno1 bond0 bond0.1
iface eno1 inet manual
bond-master bond0
iface bond0 inet manual
bond-slaves eno1
bond-mode 4
bond-lacp-rate 1
bond-miimon 100
bond-updelay 200
bond-downdelay 200
iface bond0.5 inet dhcp
vlan-raw-device bond0
eno1 -> bond0 -> bond0.5 -> dhcp
When calling "ifup -a" at boot time all three interfaces are started
at the same time.
bond0 and bond0.5 both attempt to share the same lock file:
/run/network/ifstate.bond0
If bond0 wins the race, the system will start correctly (1/10):
* bond0 starts and creates the bond0 device and the ifenslave.bond0 file to indicate the bond is ready
* eno1 polls for the ifenslave.bond0 file, when it appears it attaches eno1 to bond0
* bond0 finishes and releases the lock
* bond0.5 now acquires the lock.
* bond0.5 starts dhclient, which can talk to the network and configure the interface
If, however, bond0.2 wins the lock race, the system will hang at boot (5 mins) and fail to set up the network.
* bond0.5 is awarded the ifstate.bond0 lockfile
* bond0.5 starts dhclient waiting to hear from the network
* bond0 is blocked, so bond0 is not created nor is the bond0.ifenslave file
* eno1 polls but never finds the ifenslave.bond0 file so never attaches to bond0
* bond0.5's dhclient is trying to talk to a disconnected network and never receives an answer
! bond0.5 is stuck running dhclient
! bond0 is stuck waiting for bond0.5 to finish
! eno1 is stuck waiting for bond0 to create the ifenslave.bond0 file
I believe ifup should start interfaces (that share lock files) in dependant order. The most basic interface must be awarded the lock over its dependants. In this case:
1 eno1
2 bond0
3 bond0.5
but never:
1 eno1
2 bond0.5
3 bond0
As a work arouund, in /etc/network/interfaces
-auto eno1 bond0 bond0.1
+auto eno1 bond0
+allow-bond bond0.1
And also in /lib/systemd/system/networking.service
ExecStart=/sbin/ifup -a --read-environment
+ExecStart=/sbin/ifup -a --allow=bond --read-environment
ExecStop=/sbin/ifdown -a --read-environment
Then run:
systemctl dameon-reload
This causes all "auto" interfaces to start then, when they've
completed, all allow-bond interfaces to start.
ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: ifupdown 0.8.10ubuntu1.1 [modified: lib/systemd/system/networking.service]
ProcVersionSignature: Ubuntu 4.4.0-45.66-generic 4.4.21
Uname: Linux 4.4.0-45-generic x86_64
ApportVersion: 2.20.1-0ubuntu2.1
Architecture: amd64
Date: Wed Oct 26 06:32:57 2016
InstallationDate: Installed on 2016-10-24 (1 days ago)
InstallationMedia: Ubuntu-Server 16.04.1 LTS "Xenial Xerus" - Release amd64 (20160719)
SourcePackage: ifupdown
UpgradeStatus: No upgrade log present (probably fresh install)
modified.conffile..etc.init.networking.conf: [modified]
mtime.conffile..etc.init.networking.conf: 2016-10-26T04:52:05.750927
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ifupdown/+bug/1636708/+subscriptions
More information about the foundations-bugs
mailing list