[Bug 1988457] Re: [SRU] ovsdbapp can time out on raft leadership change
Edward Hope-Morley
1988457 at bugs.launchpad.net
Mon Sep 23 17:58:46 UTC 2024
Focal Yoga verified using [Test Case] with the following output:
# apt-cache policy python3-ovsdbapp
python3-ovsdbapp:
Installed: 1.15.1-0ubuntu2.1~cloud0
Candidate: 1.15.1-0ubuntu2.1~cloud0
Version table:
*** 1.15.1-0ubuntu2.1~cloud0 500
500 http://ubuntu-cloud.archive.canonical.com/ubuntu focal-proposed/yoga/main amd64 Packages
100 /var/lib/dpkg/status
1.1.0-0ubuntu2 500
500 http://availability-zone-2.clouds.archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages
1.1.0-0ubuntu1 500
500 http://availability-zone-2.clouds.archive.ubuntu.com/ubuntu focal/main amd64 Packages
** Tags removed: verification-needed verification-yoga-needed
** Tags added: verification-done verification-yoga-done
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1988457
Title:
[SRU] ovsdbapp can time out on raft leadership change
Status in Ubuntu Cloud Archive:
Fix Released
Status in Ubuntu Cloud Archive yoga series:
Fix Committed
Status in ovsdbapp:
Fix Released
Status in python-ovsdbapp package in Ubuntu:
Fix Released
Status in python-ovsdbapp source package in Jammy:
Fix Released
Bug description:
When raft leadership changes, any leader-only connections will be
disconnected and will need to reconnect to the new leader. When this
happens, the IDL will return a txn status of TRY_AGAIN. The current
code tries to do an exponential backoff with sleep() due to an issue
where those can be spammed 1000s of times a second. This sleep also
prevents reconnecting quickly because idl.run() is not called rapidly
and can lead to timeouts.
--------------------------------------------------------------------------------
SRU TEMPLATE:
[Impact]
Please see original bug description. What i can add to this is that
what we saw in production as a consequence of this was that ovsdbapp
transactions would fail after a timeout and ovsdbapp would then end up
in a retry sequence such that the transations would not get retried
and vm tap devices would not get deleted from ovs when a vm was
deleted. The result was a build up of "stale" tap devices on br-int
(visible as "No such device" entries in ovs-vsctl show).
[Test Plan]
* Deploy OpenStack Jammy (Yoga) with ml2-ovn
* Spawn several vms
* Trigger many ovn-central db leadership switches by restarting ovn-central units in rotation leaving enough between each for a new leader to be elected.
* Delete the vms and create a load more while leaders are being re-elected.
* First check that /var/log/nova/nova-compute.log does not contain the "OVSDB transaction returned TRY_AGAIN" message over and over then also check that ovs-vsctl show does not contain any "stale" ports with messages like the following:
Port tapa5d45fc6-02
Interface tapa5d45fc6-02
error: "could not open network device tapa5d45fc6-02 (No such device)"
[Regression Potential]
This patch is not expected to introduce any regressions.
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1988457/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list