[Bug 1668410] Re: Infinite loop trying to delete deleted HA router
Hua Zhang
joshua.zhang at canonical.com
Wed Aug 30 02:56:31 UTC 2017
** Description changed:
- Latest Mitaka code, L3 HA
- After running rally create_and_delete_routers (concurrency 100 and times 100, or more) neutron l3 agent logs on nodes filled (every .003 second timestamp) with such traces:
- http://paste.openstack.org/show/599851/
- which causes cluster fall when log partition will filled up.
+ [Impact]
+
+ When deleting a router the logfile is filled up. See full log -
+ http://paste.ubuntu.com/25429257/
+
+ I can see the error 'Error while deleting router
+ c0dab368-5ac8-4996-88c9-f5d345a774a6' occured 3343386 times from
+ _safe_router_removed() [1]:
+
+ $ grep -r 'Error while deleting router c0dab368-5ac8-4996-88c9-f5d345a774a6' |wc -l
+ 3343386
+
+ This _safe_router_removed() is invoked by L488 [2], if
+ _safe_router_removed() goes wrong it will return False, then
+ self._resync_router(update) [3] will make the code _safe_router_removed
+ be run again and again. So we saw so many errors 'Error while deleting
+ router XXXXX'.
+
+ [1] https://github.com/openstack/neutron/blob/mitaka-eol/neutron/agent/l3/agent.py#L361
+ [2] https://github.com/openstack/neutron/blob/mitaka-eol/neutron/agent/l3/agent.py#L488
+ [3] https://github.com/openstack/neutron/blob/mitaka-eol/neutron/agent/l3/agent.py#L457
+
+ [Test Case]
+
+ That's because race condition between neutron server and L3 agent, after
+ neutron server deletes HA interfaces the L3 agent may sync a HA router
+ without HA interface info (just need to trigger L708[1] after deleting
+ HA interfaces and before deleting HA router). If we delete HA router at
+ this time, this problem will happen. So test case we design is as below:
+
+ 1, Create ha_router
+
+ neutron router-create harouter --ha=True
+
+ 2, Delete ports associated with ha_router before deleting ha_router
+
+ neutron router-port-list harouter |grep 'HA port' |awk '{print $2}' |xargs -l neutron port-delete
+ neutron router-port-list harouter
+
+ 3, Update ha_router to trigger l3-agent to update ha_router info without
+ ha_port into self.router_info
+
+ neutron router-update harouter --description=test
+
+ 4, Delete ha_router this time
+
+ neutron router-delete harouter
+
+ [1] https://github.com/openstack/neutron/blob/mitaka-
+ eol/neutron/db/l3_hamode_db.py#L708
+
+ [Regression Potential]
+
+ The fixed patch [1] will no longer return ha_router which is missing
+ ha_ports, so L488 will no longer have chance to call
+ _safe_router_removed() for a ha_router, so the problem has been
+ fundamentally fixed by this patch and no regression potential.
+
+ Besides, this fixed patch has been in mitaka-eol branch now, and
+ neutron-server mitaka package is based on neutron-8.4.0, so we need to
+ backport it to xenial and mitaka.
+
+ $ git tag --contains 8c77ee6b20dd38cc0246e854711cb91cffe3a069
+ mitaka-eol
+
+ [1] https://review.openstack.org/#/c/440799/2/neutron/db/l3_hamode_db.py
+ [2] https://github.com/openstack/neutron/blob/mitaka-eol/neutron/agent/l3/agent.py#L488
** Summary changed:
- Infinite loop trying to delete deleted HA router
+ [SRU] Infinite loop trying to delete deleted HA router
** Description changed:
[Impact]
When deleting a router the logfile is filled up. See full log -
http://paste.ubuntu.com/25429257/
I can see the error 'Error while deleting router
c0dab368-5ac8-4996-88c9-f5d345a774a6' occured 3343386 times from
_safe_router_removed() [1]:
$ grep -r 'Error while deleting router c0dab368-5ac8-4996-88c9-f5d345a774a6' |wc -l
3343386
This _safe_router_removed() is invoked by L488 [2], if
_safe_router_removed() goes wrong it will return False, then
self._resync_router(update) [3] will make the code _safe_router_removed
be run again and again. So we saw so many errors 'Error while deleting
router XXXXX'.
[1] https://github.com/openstack/neutron/blob/mitaka-eol/neutron/agent/l3/agent.py#L361
[2] https://github.com/openstack/neutron/blob/mitaka-eol/neutron/agent/l3/agent.py#L488
[3] https://github.com/openstack/neutron/blob/mitaka-eol/neutron/agent/l3/agent.py#L457
[Test Case]
That's because race condition between neutron server and L3 agent, after
neutron server deletes HA interfaces the L3 agent may sync a HA router
without HA interface info (just need to trigger L708[1] after deleting
HA interfaces and before deleting HA router). If we delete HA router at
this time, this problem will happen. So test case we design is as below:
1, Create ha_router
neutron router-create harouter --ha=True
2, Delete ports associated with ha_router before deleting ha_router
neutron router-port-list harouter |grep 'HA port' |awk '{print $2}' |xargs -l neutron port-delete
neutron router-port-list harouter
3, Update ha_router to trigger l3-agent to update ha_router info without
ha_port into self.router_info
neutron router-update harouter --description=test
4, Delete ha_router this time
neutron router-delete harouter
[1] https://github.com/openstack/neutron/blob/mitaka-
eol/neutron/db/l3_hamode_db.py#L708
[Regression Potential]
- The fixed patch [1] will no longer return ha_router which is missing
- ha_ports, so L488 will no longer have chance to call
+ The fixed patch [1] for neutron-server will no longer return ha_router
+ which is missing ha_ports, so L488 will no longer have chance to call
_safe_router_removed() for a ha_router, so the problem has been
fundamentally fixed by this patch and no regression potential.
Besides, this fixed patch has been in mitaka-eol branch now, and
neutron-server mitaka package is based on neutron-8.4.0, so we need to
backport it to xenial and mitaka.
$ git tag --contains 8c77ee6b20dd38cc0246e854711cb91cffe3a069
mitaka-eol
[1] https://review.openstack.org/#/c/440799/2/neutron/db/l3_hamode_db.py
[2] https://github.com/openstack/neutron/blob/mitaka-eol/neutron/agent/l3/agent.py#L488
** Tags added: sts sts-sru-needed
** Patch added: "mitaka.debdiff"
https://bugs.launchpad.net/neutron/+bug/1668410/+attachment/4941145/+files/mitaka.debdiff
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to neutron in Ubuntu.
https://bugs.launchpad.net/bugs/1668410
Title:
[SRU] Infinite loop trying to delete deleted HA router
Status in neutron:
In Progress
Status in OpenStack Security Advisory:
Won't Fix
Status in neutron package in Ubuntu:
Triaged
Bug description:
[Impact]
When deleting a router the logfile is filled up. See full log -
http://paste.ubuntu.com/25429257/
I can see the error 'Error while deleting router
c0dab368-5ac8-4996-88c9-f5d345a774a6' occured 3343386 times from
_safe_router_removed() [1]:
$ grep -r 'Error while deleting router c0dab368-5ac8-4996-88c9-f5d345a774a6' |wc -l
3343386
This _safe_router_removed() is invoked by L488 [2], if
_safe_router_removed() goes wrong it will return False, then
self._resync_router(update) [3] will make the code
_safe_router_removed be run again and again. So we saw so many errors
'Error while deleting router XXXXX'.
[1] https://github.com/openstack/neutron/blob/mitaka-eol/neutron/agent/l3/agent.py#L361
[2] https://github.com/openstack/neutron/blob/mitaka-eol/neutron/agent/l3/agent.py#L488
[3] https://github.com/openstack/neutron/blob/mitaka-eol/neutron/agent/l3/agent.py#L457
[Test Case]
That's because race condition between neutron server and L3 agent,
after neutron server deletes HA interfaces the L3 agent may sync a HA
router without HA interface info (just need to trigger L708[1] after
deleting HA interfaces and before deleting HA router). If we delete HA
router at this time, this problem will happen. So test case we design
is as below:
1, Create ha_router
neutron router-create harouter --ha=True
2, Delete ports associated with ha_router before deleting ha_router
neutron router-port-list harouter |grep 'HA port' |awk '{print $2}' |xargs -l neutron port-delete
neutron router-port-list harouter
3, Update ha_router to trigger l3-agent to update ha_router info
without ha_port into self.router_info
neutron router-update harouter --description=test
4, Delete ha_router this time
neutron router-delete harouter
[1] https://github.com/openstack/neutron/blob/mitaka-
eol/neutron/db/l3_hamode_db.py#L708
[Regression Potential]
The fixed patch [1] for neutron-server will no longer return ha_router
which is missing ha_ports, so L488 will no longer have chance to call
_safe_router_removed() for a ha_router, so the problem has been
fundamentally fixed by this patch and no regression potential.
Besides, this fixed patch has been in mitaka-eol branch now, and
neutron-server mitaka package is based on neutron-8.4.0, so we need to
backport it to xenial and mitaka.
$ git tag --contains 8c77ee6b20dd38cc0246e854711cb91cffe3a069
mitaka-eol
[1] https://review.openstack.org/#/c/440799/2/neutron/db/l3_hamode_db.py
[2] https://github.com/openstack/neutron/blob/mitaka-eol/neutron/agent/l3/agent.py#L488
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1668410/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list