[Bug 1453264] Re: [SRU] iptables_manager can run very slowly when a large number of security group rules are present

Fri Sep 2 17:26:19 UTC 2016

Great thanks Matt & Corey

-- 
You received this bug notification because you are a member of Ubuntu
Sponsors Team, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/1453264

Title:
  [SRU] iptables_manager can run very slowly when a large number of
  security group rules are present

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive icehouse series:
  Fix Committed
Status in neutron:
  Fix Released
Status in neutron kilo series:
  Fix Released
Status in neutron package in Ubuntu:
  Fix Released
Status in neutron source package in Trusty:
  In Progress

Bug description:
  [Impact]

  We have customers that typically add a few hundred security group
  rules or more.  We also typically run 30+ VMs per compute node.  When
  about 10+ VMs with a large SG set all get scheduled to the same node,
  the L2 agent (OVS) can spend many minutes in the
  iptables_manager.apply() code, so much so that by the time all the
  rules are updated, the VM has already tried DHCP and failed, leaving
  it in an unusable state.

  While there have been some patches that tried to address this in Juno
  and Kilo, they've either not helped as much as necessary, or broken
  SGs completely due to re-ordering the of the iptables rules.

  I've been able to show some pretty bad scaling with just a handful of
  VMs running in devstack based on today's code (May 8th, 2015) from
  upstream Openstack.

  [Test Case]

  Here's what I tested:

  1. I created a security group with 1000 TCP port rules (you could
  alternately have a smaller number of rules and more VMs, but it's
  quicker this way)

  2. I booted VMs, specifying both the default and "large" SGs, and
  timed from the second it took Neutron to "learn" about the port until
  it completed it's work

  3. I got a :( pretty quickly

  And here's some data:

  1-3 VM - didn't time, less than 20 seconds
  4th VM - 0:36
  5th VM - 0:53
  6th VM - 1:11
  7th VM - 1:25
  8th VM - 1:48
  9th VM - 2:14

  While it's busy adding the rules, the OVS agent is consuming pretty
  close to 100% of a CPU for most of this time (from top):

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  25767 stack     20   0  157936  76572   4416 R  89.2  0.5  50:14.28 python

  And this is with only ~10K rules at this point!  When we start
  crossing the 20K point VM boot failures start to happen.

  I'm filing this bug since we need to take a closer look at this in
  Liberty and fix it, it's been this way since Havana and needs some
  TLC.

  I've attached a simple script I've used to recreate this, and will
  start taking a look at options here.

  [Regression Potential]

  Minimal since this has been running in upstream stable for several
  releases now (Kilo, Liberty, Mitaka).

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1453264/+subscriptions