[Bug 1827264] Re: ovs-vswitchd thread consuming 100% CPU

Magnus Bergman 1827264 at bugs.launchpad.net
Mon Nov 11 08:55:25 UTC 2019


We are also seeing this in bionic with rocky neutron-openvswitch-agents.
We haven't really found any pattern in when it occurs, but even after
emptying out a compute node with live migrations and rebooting it, we
will keep seeing it. Amount of traffic doesn't seem to be related.

What we are seeing are an increasing number of handler threads using
100% cpu which in our setup can mean several 1000% in total being used
by ovs-vswitchd. While investigating we found that the handler threads
seemed to spend quite a lot of time in native_queued_spin_lock_slowpath,
making us believe there might be too many threads running competing for
the same locks so we tried lowering other_config:n-handler-threads,
which actually seemed to fix the issue, but it seems to have just been a
temporary fix as we still see the issue and in fact may just be that the
modification of the other_config:n-handler-threads variable slays the
handlers that have been stuck in 100% CPU and spun up new fresh ones.

I find it strange however that a rebooted node doesn't see the same
though which makes me think there is something going on during startup
of either OVS or openstack-neutron-agent that in some circumstances
causes this.

Investigation is on-going, just wanted to share our observations so far.

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to openvswitch in Ubuntu.
https://bugs.launchpad.net/bugs/1827264

Title:
  ovs-vswitchd thread consuming 100% CPU

Status in openvswitch package in Ubuntu:
  Confirmed

Bug description:
  I have an ovs-vswitchd process consuming 100% CPU in a very lightly
  used Openstack Rocky cloud running on Bionic. The version in question
  is openvswitch-switch 2.10.0-0ubuntu2~cloud0

  ovs-vswitchd is running alongside various neutron processes
  (lbaasv2-agent, metadata-agent, l3-agent, dhcp-agent, openvswitch-
  agent) inside an LXC container on a physical host. There is a single
  neutron router, and the entire environment including br-tun, br-ex,
  and br-int traffic barely goes over 200KiB/s TX/RX combined.

  The issue appears to have arisen on its own after the host was hard
  rebooted. On restart ovs-vswitchd came up with high load and it has
  not diminished since then.

  
  The thread that is consuming CPU looks like this in ps and shows up with the name 'handler89':

  UID          PID    SPID    PPID  C STIME TTY          TIME CMD
  root        7267    7454       1 99 Apr23 ?        8-03:38:14 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach

  
  Logs in /var/log/openvswitch/ovs-vswitchd.log show ~14000 entries like this per day - roughly one per second:

  2019-05-01T18:34:30.110Z|237220|poll_loop(handler89)|INFO|Dropped
  1092844 log messages in last 6 seconds (most recently, 0 seconds ago)
  due to excessive rate

  2019-05-01T18:34:30.110Z|237221|poll_loop(handler89)|INFO|wakeup due
  to [POLLIN] on fd 42 (unknown anon_inode:[eventpoll]) at ../lib/dpif-
  netlink.c:2786 (99% CPU usage)


  Logs in /var/log/neutron/neutron-openvswitch-agent.log show entries
  like this, again one set per second for ~14000 per day:

  2019-05-01 18:35:31.174 13621 DEBUG
  neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-
  7e354ade-55a3-46ba-adf4-db49a51e5218 - - - - -] Agent rpc_loop -
  iteration:355070 started rpc_loop /usr/lib/python3/dist-
  packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:2086

  2019-05-01 18:35:31.177 13621 DEBUG
  neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch
  [req-7e354ade-55a3-46ba-adf4-db49a51e5218 - - - - -] ofctl request
  version=0x4,msg_type=0x12,msg_len=0x38,xid=0x81826d95,OFPFlowStatsRequest(cookie=0,cookie_mask=0,flags=0,match=OFPMatch(oxm_fields={}),out_group=4294967295,out_port=4294967295,table_id=23,type=1)
  result
  [OFPFlowStatsReply(body=[OFPFlowStats(byte_count=0,cookie=13408676230369642694,duration_nsec=724000000,duration_sec=710697,flags=0,hard_timeout=0,idle_timeout=0,instructions=[],length=56,match=OFPMatch(oxm_fields={}),packet_count=0,priority=0,table_id=23)],flags=0,type=1)]
  _send_msg /usr/lib/python3/dist-
  packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py:114

  2019-05-01 18:35:31.179 13621 DEBUG
  neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-
  7e354ade-55a3-46ba-adf4-db49a51e5218 - - - - -] Agent rpc_loop -
  iteration:355070 completed. Processed ports statistics: {'regular':
  {'added': 0, 'updated': 0, 'removed': 0}}. Elapsed:0.005
  loop_count_and_wait /usr/lib/python3/dist-
  packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1875

  
  I'm not sure what else information to add here but am happy to gather more diagnostic data to try to pin this down. I did come across https://mail.openvswitch.org/pipermail/ovs-discuss/2018-October/047558.html which is the same version as mine and the symptoms appear to be the same. 

  
  Below are various ovs-dpctl/ofctl show reports:

  
  root at juju-df624b-4-lxd-10:~# ovs-dpctl show
  system at ovs-system:
    lookups: hit:223561120 missed:5768546 lost:798
    flows: 131
    masks: hit:2284286371 total:15 hit/pkt:9.96
    port 0: ovs-system (internal)
    port 1: br-ex (internal)
    port 2: eth1
    port 3: gre_sys (gre: packet_type=ptap)
    port 4: br-tun (internal)
    port 5: br-int (internal)
    port 6: tapa062d6f1-40
    port 7: tapf90e3ab6-13
    port 8: tap45ba891c-4c

  
  root at juju-df624b-4-lxd-10:~# ovs-ofctl show br-int
  OFPT_FEATURES_REPLY (xid=0x2): dpid:00003643edb09542
  n_tables:254, n_buffers:0
  capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP
  actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst
   1(int-br-ex): addr:4a:b4:cc:dd:aa:ac
       config:     0
       state:      0
       speed: 0 Mbps now, 0 Mbps max
   2(patch-tun): addr:7a:36:76:1f:47:6e
       config:     0
       state:      0
       speed: 0 Mbps now, 0 Mbps max
   3(tapa062d6f1-40): addr:92:d4:6c:26:55:0a
       config:     0
       state:      0
       current:    10GB-FD COPPER
       speed: 10000 Mbps now, 0 Mbps max
   4(tapf90e3ab6-13): addr:9e:8b:4f:ae:8f:ba
       config:     0
       state:      0
       current:    10GB-FD COPPER
       speed: 10000 Mbps now, 0 Mbps max
   5(tap45ba891c-4c): addr:76:45:39:c8:d7:e0
       config:     0
       state:      0
       current:    10GB-FD COPPER
       speed: 10000 Mbps now, 0 Mbps max
   LOCAL(br-int): addr:36:43:ed:b0:95:42
       config:     PORT_DOWN
       state:      LINK_DOWN
       speed: 0 Mbps now, 0 Mbps max
  OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0


  root at juju-df624b-4-lxd-10:~# ovs-ofctl show br-tun
  OFPT_FEATURES_REPLY (xid=0x2): dpid:000092c6c8f4a545
  n_tables:254, n_buffers:0
  capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP
  actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst
   1(patch-int): addr:02:1d:bc:ec:0c:3b
       config:     0
       state:      0
       speed: 0 Mbps now, 0 Mbps max
   2(gre-0a30029b): addr:fe:09:61:11:92:8b
       config:     0
       state:      0
       speed: 0 Mbps now, 0 Mbps max
   3(gre-0a3002a2): addr:a2:64:93:48:e2:82
       config:     0
       state:      0
       speed: 0 Mbps now, 0 Mbps max
   4(gre-0a30029d): addr:12:b1:73:32:19:d4
       config:     0
       state:      0
       speed: 0 Mbps now, 0 Mbps max
   5(gre-0a3002ce): addr:ca:10:51:f2:0f:05
       config:     0
       state:      0
       speed: 0 Mbps now, 0 Mbps max
   6(gre-0a3002a1): addr:de:fe:95:33:d9:67
       config:     0
       state:      0
       speed: 0 Mbps now, 0 Mbps max
   7(gre-0a30029e): addr:5a:fa:ce:15:7b:5c
       config:     0
       state:      0
       speed: 0 Mbps now, 0 Mbps max
   8(gre-0a3002c3): addr:b6:87:66:09:fc:04
       config:     0
       state:      0
       speed: 0 Mbps now, 0 Mbps max
   10(gre-0a3002a0): addr:ce:79:f2:bf:a8:94
       config:     0
       state:      0
       speed: 0 Mbps now, 0 Mbps max
   14(gre-0a30029f): addr:b6:b0:84:7d:9f:aa
       config:     0
       state:      0
       speed: 0 Mbps now, 0 Mbps max
   18(gre-0a30029c): addr:da:14:12:54:1c:ec
       config:     0
       state:      0
       speed: 0 Mbps now, 0 Mbps max
   LOCAL(br-tun): addr:92:c6:c8:f4:a5:45
       config:     PORT_DOWN
       state:      LINK_DOWN
       speed: 0 Mbps now, 0 Mbps max
  OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0


  
  root at juju-df624b-4-lxd-10:~# ovs-ofctl show br-ex
  OFPT_FEATURES_REPLY (xid=0x2): dpid:000000163e9a47a2
  n_tables:254, n_buffers:0
  capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP
  actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst
   1(eth1): addr:00:16:3e:9a:47:a2
       config:     0
       state:      0
       current:    10GB-FD COPPER
       speed: 10000 Mbps now, 0 Mbps max
   2(phy-br-ex): addr:ce:a8:6a:25:55:0e
       config:     0
       state:      0
       speed: 0 Mbps now, 0 Mbps max
   LOCAL(br-ex): addr:00:16:3e:9a:47:a2
       config:     0
       state:      0
       speed: 0 Mbps now, 0 Mbps max
  OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1827264/+subscriptions



More information about the Ubuntu-openstack-bugs mailing list