[Bug 1961046] [NEW] Duplicate IP address on LSPs in the same LS lead to unpredictable flow output
Frode Nordahl
1961046 at bugs.launchpad.net
Wed Feb 16 10:31:16 UTC 2022
Public bug reported:
In a busy CI cloud we ran into an issue that took us days to get to the
bottom of.
Essentially what happens is that the CMS is sometimes unable to complete
deletion of a LSP (being worked on separately [0]).
Subsequently a new LSP is created using the same IP address but a
different MAC address. There exists duplicate IP checks in the ovn-nbctl
administration tool [1], but the current database schema does not apply
any constraints on a CMS adding duplicates and despite [1] citing ovn-
northd should log this, it does not appear to do so.
When a system is in the above described state the flow output on
hypervisors is unpredictable, and this leads to instances not being able
to reach the router nor the outside world.
What happens inside the ovn-controller is that the function looking up
installed flows [2] only does a table, priority and miniflow match, and
does not consider the possibility of other collisions.
Subsequently an attempt to link an desired flow to an installed flow
unexpectedly fails, and the flow is just not processed at all [3].
At the very least I think we need to investigate why this situation is not logged in ovn-northd, but I don't think just logging this is enough.
Is there something we can change in the database model that would
prevent a CMS from getting itself into this situation in the first
place?
Should we consider also not generating the lflows?
Example lflows:
table=15(lr_in_arp_resolve ), priority=500 , match=(ip4.mcast || ip6.mcast), action=(next;)
table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "R1_R2" && reg0== 20.0.0.2), action=(eth.dst = 00:00:00:02:03:05; next;)
table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "R1_R2" && xxreg0 == fe80::200:ff:fe02:305), action=(eth.dst = 00:00:00:02:03:05; next;)
table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "ls1" && reg0 == 192.168.1.11), action=(eth.dst = f0:00:00:01:01:01; next;)
table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "ls1" && reg0 == 192.168.1.11), action=(eth.dst = f0:00:00:01:01:02; next;)
table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "ls1" && reg0 == 192.168.1.11), action=(eth.dst = f0:00:00:01:01:03; next;)
table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "ls1" && reg0 == 192.168.1.11), action=(eth.dst = f0:00:00:01:01:04; next;)
table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "ls1" && reg0 == 192.168.1.11), action=(eth.dst = f0:00:00:01:01:05; next;)
table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "ls1" && reg0 == 192.168.1.11), action=(eth.dst = f0:00:00:01:01:06; next;)
table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "ls1" && reg0 == 192.168.1.11), action=(eth.dst = f0:00:00:01:01:07; next;)
table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "ls1" && reg0 == 192.168.1.11), action=(eth.dst = f0:00:00:01:01:08; next;)
table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "ls1" && reg0 == 192.168.1.11), action=(eth.dst = f0:00:00:01:01:09; next;)
table=15(lr_in_arp_resolve ), priority=0 , match=(ip4), action=(get_arp(outport, reg0); next;)
table=15(lr_in_arp_resolve ), priority=0 , match=(ip6), action=(get_nd(outport, xxreg0); next;)
Resulting flow table:
cookie=0xe683b51c, duration=0.128s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=500,ipv6,metadata=0x2,dl_dst=33:33:00:00:00:00/ff:ff:00:00:00:00,ipv6_dst=ff00::/8 actions=resubmit(,24)
cookie=0xe683b51c, duration=0.128s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=500,ipv6,metadata=0x1,dl_dst=33:33:00:00:00:00/ff:ff:00:00:00:00,ipv6_dst=ff00::/8 actions=resubmit(,24)
cookie=0xe683b51c, duration=0.128s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=500,ip,metadata=0x2,dl_dst=01:00:00:00:00:00/01:00:00:00:00:00,nw_dst=224.0.0.0/4 actions=resubmit(,24)
cookie=0xe683b51c, duration=0.128s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=500,ip,metadata=0x1,dl_dst=01:00:00:00:00:00/01:00:00:00:00:00,nw_dst=224.0.0.0/4 actions=resubmit(,24)
cookie=0xed4bbf00, duration=0.120s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=100,reg0=0xfe800000,reg1=0,reg2=0x20000ff,reg3=0xfe020304,reg15=0x2,metadata=0x2 actions=set_field:00:00:00:02:03:04->eth_dst,resubmit(,24)
cookie=0x601769fe, duration=0.114s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=100,reg0=0xfe800000,reg1=0,reg2=0x20000ff,reg3=0xfe020305,reg15=0x2,metadata=0x1 actions=set_field:00:00:00:02:03:05->eth_dst,resubmit(,24)
cookie=0x92b4073a, duration=0.118s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=100,reg0=0x14000001,reg15=0x2,metadata=0x2 actions=set_field:00:00:00:02:03:04->eth_dst,resubmit(,24)
cookie=0x87dacd46, duration=0.118s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=100,reg0=0x14000002,reg15=0x2,metadata=0x1 actions=set_field:00:00:00:02:03:05->eth_dst,resubmit(,24)
cookie=0xa7d4ecf1, duration=0.113s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=100,reg0=0xc0a8010b,reg15=0x1,metadata=0x1 actions=set_field:f0:00:00:01:01:04->eth_dst,resubmit(,24)
cookie=0xb595291d, duration=0.128s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=0,ip,metadata=0x2 actions=push:NXM_NX_REG0[],push:NXM_NX_XXREG0[96..127],pop:NXM_NX_REG0[],set_field:00:00:00:00:00:00->eth_dst,resubmit(,66),pop:NXM_NX_REG0[],resubmit(,24)
cookie=0xb595291d, duration=0.128s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=0,ip,metadata=0x1 actions=push:NXM_NX_REG0[],push:NXM_NX_XXREG0[96..127],pop:NXM_NX_REG0[],set_field:00:00:00:00:00:00->eth_dst,resubmit(,66),pop:NXM_NX_REG0[],resubmit(,24)
cookie=0xb7e18e6b, duration=0.127s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=0,ipv6,metadata=0x2 actions=set_field:00:00:00:00:00:00->eth_dst,resubmit(,66),resubmit(,24)
cookie=0xb7e18e6b, duration=0.127s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=0,ipv6,metadata=0x1 actions=set_field:00:00:00:00:00:00->eth_dst,resubmit(,66),resubmit(,24)
cookie=0xb60abbb8, duration=0.125s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=0,metadata=0x4 actions=resubmit(,24)
cookie=0xb60abbb8, duration=0.125s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=0,metadata=0x3 actions=resubmit(,24)
As you can see the mac address to program the flow for is chosen at
"random".
0: https://review.opendev.org/c/openstack/neutron/+/827834
1: https://github.com/ovn-org/ovn/commit/738295605b44e878ab6ddbcf3068354988a1c82e
2: https://github.com/ovn-org/ovn/blob/ed81be75e8b3b33745eeb9b6ce2686b87ef72cd0/controller/ofctrl.c#L1506-L1523
3: https://github.com/ovn-org/ovn/blob/ed81be75e8b3b33745eeb9b6ce2686b87ef72cd0/controller/ofctrl.c#L1973-L1976
** Affects: ovn (Ubuntu)
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ovn in Ubuntu.
https://bugs.launchpad.net/bugs/1961046
Title:
Duplicate IP address on LSPs in the same LS lead to unpredictable flow
output
Status in ovn package in Ubuntu:
New
Bug description:
In a busy CI cloud we ran into an issue that took us days to get to
the bottom of.
Essentially what happens is that the CMS is sometimes unable to
complete deletion of a LSP (being worked on separately [0]).
Subsequently a new LSP is created using the same IP address but a
different MAC address. There exists duplicate IP checks in the ovn-
nbctl administration tool [1], but the current database schema does
not apply any constraints on a CMS adding duplicates and despite [1]
citing ovn-northd should log this, it does not appear to do so.
When a system is in the above described state the flow output on
hypervisors is unpredictable, and this leads to instances not being
able to reach the router nor the outside world.
What happens inside the ovn-controller is that the function looking up
installed flows [2] only does a table, priority and miniflow match,
and does not consider the possibility of other collisions.
Subsequently an attempt to link an desired flow to an installed flow
unexpectedly fails, and the flow is just not processed at all [3].
At the very least I think we need to investigate why this situation is not logged in ovn-northd, but I don't think just logging this is enough.
Is there something we can change in the database model that would
prevent a CMS from getting itself into this situation in the first
place?
Should we consider also not generating the lflows?
Example lflows:
table=15(lr_in_arp_resolve ), priority=500 , match=(ip4.mcast || ip6.mcast), action=(next;)
table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "R1_R2" && reg0== 20.0.0.2), action=(eth.dst = 00:00:00:02:03:05; next;)
table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "R1_R2" && xxreg0 == fe80::200:ff:fe02:305), action=(eth.dst = 00:00:00:02:03:05; next;)
table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "ls1" && reg0 == 192.168.1.11), action=(eth.dst = f0:00:00:01:01:01; next;)
table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "ls1" && reg0 == 192.168.1.11), action=(eth.dst = f0:00:00:01:01:02; next;)
table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "ls1" && reg0 == 192.168.1.11), action=(eth.dst = f0:00:00:01:01:03; next;)
table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "ls1" && reg0 == 192.168.1.11), action=(eth.dst = f0:00:00:01:01:04; next;)
table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "ls1" && reg0 == 192.168.1.11), action=(eth.dst = f0:00:00:01:01:05; next;)
table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "ls1" && reg0 == 192.168.1.11), action=(eth.dst = f0:00:00:01:01:06; next;)
table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "ls1" && reg0 == 192.168.1.11), action=(eth.dst = f0:00:00:01:01:07; next;)
table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "ls1" && reg0 == 192.168.1.11), action=(eth.dst = f0:00:00:01:01:08; next;)
table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "ls1" && reg0 == 192.168.1.11), action=(eth.dst = f0:00:00:01:01:09; next;)
table=15(lr_in_arp_resolve ), priority=0 , match=(ip4), action=(get_arp(outport, reg0); next;)
table=15(lr_in_arp_resolve ), priority=0 , match=(ip6), action=(get_nd(outport, xxreg0); next;)
Resulting flow table:
cookie=0xe683b51c, duration=0.128s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=500,ipv6,metadata=0x2,dl_dst=33:33:00:00:00:00/ff:ff:00:00:00:00,ipv6_dst=ff00::/8 actions=resubmit(,24)
cookie=0xe683b51c, duration=0.128s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=500,ipv6,metadata=0x1,dl_dst=33:33:00:00:00:00/ff:ff:00:00:00:00,ipv6_dst=ff00::/8 actions=resubmit(,24)
cookie=0xe683b51c, duration=0.128s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=500,ip,metadata=0x2,dl_dst=01:00:00:00:00:00/01:00:00:00:00:00,nw_dst=224.0.0.0/4 actions=resubmit(,24)
cookie=0xe683b51c, duration=0.128s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=500,ip,metadata=0x1,dl_dst=01:00:00:00:00:00/01:00:00:00:00:00,nw_dst=224.0.0.0/4 actions=resubmit(,24)
cookie=0xed4bbf00, duration=0.120s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=100,reg0=0xfe800000,reg1=0,reg2=0x20000ff,reg3=0xfe020304,reg15=0x2,metadata=0x2 actions=set_field:00:00:00:02:03:04->eth_dst,resubmit(,24)
cookie=0x601769fe, duration=0.114s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=100,reg0=0xfe800000,reg1=0,reg2=0x20000ff,reg3=0xfe020305,reg15=0x2,metadata=0x1 actions=set_field:00:00:00:02:03:05->eth_dst,resubmit(,24)
cookie=0x92b4073a, duration=0.118s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=100,reg0=0x14000001,reg15=0x2,metadata=0x2 actions=set_field:00:00:00:02:03:04->eth_dst,resubmit(,24)
cookie=0x87dacd46, duration=0.118s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=100,reg0=0x14000002,reg15=0x2,metadata=0x1 actions=set_field:00:00:00:02:03:05->eth_dst,resubmit(,24)
cookie=0xa7d4ecf1, duration=0.113s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=100,reg0=0xc0a8010b,reg15=0x1,metadata=0x1 actions=set_field:f0:00:00:01:01:04->eth_dst,resubmit(,24)
cookie=0xb595291d, duration=0.128s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=0,ip,metadata=0x2 actions=push:NXM_NX_REG0[],push:NXM_NX_XXREG0[96..127],pop:NXM_NX_REG0[],set_field:00:00:00:00:00:00->eth_dst,resubmit(,66),pop:NXM_NX_REG0[],resubmit(,24)
cookie=0xb595291d, duration=0.128s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=0,ip,metadata=0x1 actions=push:NXM_NX_REG0[],push:NXM_NX_XXREG0[96..127],pop:NXM_NX_REG0[],set_field:00:00:00:00:00:00->eth_dst,resubmit(,66),pop:NXM_NX_REG0[],resubmit(,24)
cookie=0xb7e18e6b, duration=0.127s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=0,ipv6,metadata=0x2 actions=set_field:00:00:00:00:00:00->eth_dst,resubmit(,66),resubmit(,24)
cookie=0xb7e18e6b, duration=0.127s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=0,ipv6,metadata=0x1 actions=set_field:00:00:00:00:00:00->eth_dst,resubmit(,66),resubmit(,24)
cookie=0xb60abbb8, duration=0.125s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=0,metadata=0x4 actions=resubmit(,24)
cookie=0xb60abbb8, duration=0.125s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=0,metadata=0x3 actions=resubmit(,24)
As you can see the mac address to program the flow for is chosen at
"random".
0: https://review.opendev.org/c/openstack/neutron/+/827834
1: https://github.com/ovn-org/ovn/commit/738295605b44e878ab6ddbcf3068354988a1c82e
2: https://github.com/ovn-org/ovn/blob/ed81be75e8b3b33745eeb9b6ce2686b87ef72cd0/controller/ofctrl.c#L1506-L1523
3: https://github.com/ovn-org/ovn/blob/ed81be75e8b3b33745eeb9b6ce2686b87ef72cd0/controller/ofctrl.c#L1973-L1976
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ovn/+bug/1961046/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list