[Bug 1453264] Re: iptables_manager can run very slowly when a large number of security group rules are present

Mon Aug 29 22:13:26 UTC 2016

Uploading debdiff based on what is currently available in trusty-
proposed since that has been verified and pending release.

** Description changed:

+ [Impact]
+ 
  We have customers that typically add a few hundred security group rules
  or more.  We also typically run 30+ VMs per compute node.  When about
  10+ VMs with a large SG set all get scheduled to the same node, the L2
  agent (OVS) can spend many minutes in the iptables_manager.apply() code,
  so much so that by the time all the rules are updated, the VM has
  already tried DHCP and failed, leaving it in an unusable state.

  While there have been some patches that tried to address this in Juno
  and Kilo, they've either not helped as much as necessary, or broken SGs
  completely due to re-ordering the of the iptables rules.

  I've been able to show some pretty bad scaling with just a handful of
  VMs running in devstack based on today's code (May 8th, 2015) from
  upstream Openstack.
+ 
+ 
+ [Test Case]

  Here's what I tested:

  1. I created a security group with 1000 TCP port rules (you could
  alternately have a smaller number of rules and more VMs, but it's
  quicker this way)

  2. I booted VMs, specifying both the default and "large" SGs, and timed
  from the second it took Neutron to "learn" about the port until it
  completed it's work

  3. I got a :( pretty quickly

  And here's some data:

  1-3 VM - didn't time, less than 20 seconds
  4th VM - 0:36
  5th VM - 0:53
  6th VM - 1:11
  7th VM - 1:25
  8th VM - 1:48
  9th VM - 2:14

  While it's busy adding the rules, the OVS agent is consuming pretty
  close to 100% of a CPU for most of this time (from top):

-   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
+   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  25767 stack     20   0  157936  76572   4416 R  89.2  0.5  50:14.28 python

  And this is with only ~10K rules at this point!  When we start crossing
  the 20K point VM boot failures start to happen.

  I'm filing this bug since we need to take a closer look at this in
  Liberty and fix it, it's been this way since Havana and needs some TLC.

  I've attached a simple script I've used to recreate this, and will start
  taking a look at options here.
+ 
+ 
+ [Regression Potential]
+ 
+ Minimal since this has been running in upstream stable for several
+ releases now (Kilo, Liberty, Mitaka).

** Also affects: neutron (Ubuntu)
   Importance: Undecided
       Status: New

** Patch added: "trusty patch based on -proposed"
   https://bugs.launchpad.net/ubuntu/+source/neutron/+bug/1453264/+attachment/4730270/+files/lp1453264.debdiff

** Also affects: cloud-archive
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to neutron in Ubuntu.
https://bugs.launchpad.net/bugs/1453264

Title:
  iptables_manager can run very slowly when a large number of security
  group rules are present

Status in Ubuntu Cloud Archive:
  New
Status in neutron:
  Fix Released
Status in neutron kilo series:
  Fix Released
Status in neutron package in Ubuntu:
  New

Bug description:
  [Impact]

  We have customers that typically add a few hundred security group
  rules or more.  We also typically run 30+ VMs per compute node.  When
  about 10+ VMs with a large SG set all get scheduled to the same node,
  the L2 agent (OVS) can spend many minutes in the
  iptables_manager.apply() code, so much so that by the time all the
  rules are updated, the VM has already tried DHCP and failed, leaving
  it in an unusable state.

  While there have been some patches that tried to address this in Juno
  and Kilo, they've either not helped as much as necessary, or broken
  SGs completely due to re-ordering the of the iptables rules.

  I've been able to show some pretty bad scaling with just a handful of
  VMs running in devstack based on today's code (May 8th, 2015) from
  upstream Openstack.

  [Test Case]

  Here's what I tested:

  1. I created a security group with 1000 TCP port rules (you could
  alternately have a smaller number of rules and more VMs, but it's
  quicker this way)

  2. I booted VMs, specifying both the default and "large" SGs, and
  timed from the second it took Neutron to "learn" about the port until
  it completed it's work

  3. I got a :( pretty quickly

  And here's some data:

  1-3 VM - didn't time, less than 20 seconds
  4th VM - 0:36
  5th VM - 0:53
  6th VM - 1:11
  7th VM - 1:25
  8th VM - 1:48
  9th VM - 2:14

  While it's busy adding the rules, the OVS agent is consuming pretty
  close to 100% of a CPU for most of this time (from top):

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  25767 stack     20   0  157936  76572   4416 R  89.2  0.5  50:14.28 python

  And this is with only ~10K rules at this point!  When we start
  crossing the 20K point VM boot failures start to happen.

  I'm filing this bug since we need to take a closer look at this in
  Liberty and fix it, it's been this way since Havana and needs some
  TLC.

  I've attached a simple script I've used to recreate this, and will
  start taking a look at options here.

  [Regression Potential]

  Minimal since this has been running in upstream stable for several
  releases now (Kilo, Liberty, Mitaka).

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1453264/+subscriptions