[Bug 1896734] Re: A privsep daemon hangs when debug logging is enabled

Dmitrii Shcherbakov 1896734 at bugs.launchpad.net
Thu Sep 24 19:42:23 UTC 2020


~nikolay.vinogradov and I found the root cause after a long debugging
session.

What we have:

* a bionic-ussuri deployment with the HWE (focal 5.4) kernel;

* 2 processes communicating over a type=STREAM unix socket (blocking);
* Unix sockets have max buffer limits which means the amount of data that can be written there before a blocking occurs is far from infinite: https://elixir.bootlin.com/linux/v5.4.67/source/include/net/sock.h#L2557

* the privsep daemon process is the server which responds to a client
(neutron-openvswitch-agent) request "get_devices_with_ip"

   * File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 1364, in get_devices_with_ip
   * File "/usr/lib/python3/dist-packages/oslo_privsep/daemon.py", line 195, in remote_call

* when debug logging is enabled, the privsep daemon returns a log
message to be printed by the caller to its log which includes the full
reply;


* the client has a separate reader thread which reads messages from the unix socket in a loop and sets results for previously created Futures associated with message ids. The original thread that initiates a request waits for a result associated with the Future to appear;

https://github.com/openstack/oslo.privsep/blob/stable/ussuri/oslo_privsep/comm.py#L120-L124 (thread creation)
https://github.com/openstack/oslo.privsep/blob/stable/ussuri/oslo_privsep/comm.py#L130-L155 (the thread's main function)

https://github.com/openstack/oslo.privsep/blob/stable/ussuri/oslo_privsep/comm.py#L172
(the code that waits for the result of a future to appear).

* msgpack is used for serialization of data sent of the unix socket;

* the node has hardware offloading enabled and there are many network
interfaces registered with the kernel. The `ip addr show` output alone
has 764 lines:

root at node2:~# ip a s | wc -l
764

* The log message that the privsep daemon attempts to send is 1483120
bytes = 1.414 MiB (not counting the overhead).

https://paste.ubuntu.com/p/Sf2wrcpWvB/
(Pdb) where
  /usr/bin/privsep-helper(10)<module>()
-> sys.exit(helper_main())
  /usr/lib/python3/dist-packag
  /usr/lib/python3/dist-packages/oslo_privsep/daemon.py(170)emit()
-> self.channel.send((None, (Message.LOG, data)))
> /usr/lib/python3/dist-packages/oslo_privsep/comm.py(202)send()
-> self.writer.send(msg)

(Pdb) len(str(msg))
1483120


* The client reader thread tries to receive this message over the unix socket but receives an error from msgpack which does not like the message size:

(Pdb) l
131             import rpdb ; rpdb.set_trace()
132             with self.lock:
133                 self.running = True
134             for msg in reader:
135                 msgid, data = msg
136  ->             if msgid is None:
137                     self.out_of_band(data)
138                 else:
139                     with self.lock:
140                         if msgid not in self.outstanding_msgs:
141                             raise AssertionError("msgid should in "
(Pdb) n
> /usr/lib/python3/dist-packages/oslo_privsep/comm.py(137)_reader_main()
-> self.out_of_band(data)
(Pdb) n
> /usr/lib/python3/dist-packages/oslo_privsep/comm.py(134)_reader_main()
-> for msg in reader:
(Pdb) n
ValueError: ('%s exceeds max_str_len(%s)', 1482520, 1048576)


https://git.launchpad.net/ubuntu/+source/python-msgpack/tree/msgpack/unpack.h?h=ubuntu/focal#n229
        PyErr_Format(PyExc_ValueError, "%u exceeds max_str_len(%zd)", l, u->max_str_len);
https://github.com/msgpack/msgpack-python/blob/v0.6.2/msgpack/unpack.h#L226-L229


* The above exception never gets logged by neutron-openvswitch-agent and the requesting thread is stuck while waiting for a result from a Future object.

** Also affects: python-oslo.privsep (Ubuntu)
   Importance: Undecided
       Status: New

** Summary changed:

- A privsep daemon hangs when debug logging is enabled
+ A privsep daemon spawned by neutron-openvswitch-agent hangs when debug logging is enabled (large number of registered NICs) - an RPC response is too large for msgpack

** Changed in: charm-neutron-openvswitch
       Status: New => Invalid

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to python-oslo.privsep in Ubuntu.
https://bugs.launchpad.net/bugs/1896734

Title:
  A privsep daemon spawned by neutron-openvswitch-agent hangs when debug
  logging is enabled (large number of registered NICs) - an RPC response
  is too large for msgpack

Status in OpenStack neutron-openvswitch charm:
  Invalid
Status in oslo.privsep:
  New
Status in python-oslo.privsep package in Ubuntu:
  New

Bug description:
  While trying to debug a different issue, I encountered a situation
  where privsep hangs in the process of handling a request from neutron-
  openvswitch-agent when debug logging is enabled (juju debug-log
  neutron-openvswitch=true):

  https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/11
  https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/12

  The issue gets reproduced reliably in the environment where I
  encountered it on all units. As a result, neutron-openvswitch-agent
  services hang while waiting for a response from the privsep daemon and
  do not progress past basic initialization. They never post any state
  back to the Neutron server and thus are marked dead by it.

  The processes though are shown as "active (running)" by systemd which
  adds to the confusion since they do indeed start from the systemd's
  perspective.

  systemctl --no-pager status neutron-openvswitch-agent.service
  ● neutron-openvswitch-agent.service - Openstack Neutron Open vSwitch Plugin Agent
     Loaded: loaded (/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2020-09-23 08:28:41 UTC; 25min ago
   Main PID: 247772 (/usr/bin/python)
      Tasks: 4 (limit: 9830)
     CGroup: /system.slice/neutron-openvswitch-agent.service
             ├─247772 /usr/bin/python3 /usr/bin/neutron-openvswitch-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/plugins/ml2/openvswitch_…og
             └─248272 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini -…ck

  --------------------------------------------------------

  An strace shows that the privsep daemon tries to receive input from fd
  3 which is the unix socket it uses to communicate with the client.
  However, this is just one tread out of many spawned by the privsep
  daemon so it is unlikely to be the root cause (there are 65 threads
  there in total, see https://paste.ubuntu.com/p/fbGvN2P8rP/)

  # there is one extra neutron-openvvswitch-agent running in a LXD container which can be ignored here (there is an octavia unit on the node which has a neutron-openvswitch subordinate)
  root at node2:~# ps -eo pid,user,args --sort user | grep -P 'privsep.*openvswitch'
   860690 100000   /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmp910qakfk/privsep.sock
   248272 root     /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpcmwn7vom/privsep.sock
   363905 root     grep --color=auto -P privsep.*openvswitch

  root at node2:~# strace -f -p 248453 2>&1
  [pid 248786] futex(0x7f6a6401c1d0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
  [pid 248475] futex(0x7f6a6c024590, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
  [pid 248473] futex(0x7f6a746d9fd0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
  [pid 248453] recvfrom(3,

  root at node2:~# lsof -p 248453  | grep 3u
  privsep-h 248453 root    3u  unix 0xffff8e6d8abdec00      0t0 356522977 type=STREAM

  root at node2:~# ss -pax | grep 356522977
  u_str             ESTAB               0                    0                                                         /tmp/tmp2afa3enn/privsep.sock 356522978
                                    * 356522977                              users:(("/usr/bin/python",pid=247567,fd=16))
  u_str             ESTAB               0                    0                                                                                     * 356522977
                                    * 356522978                              users:(("privsep-helper",pid=248453,fd=3))

  root at node2:~# lsof -p 247567  | grep 16u
  /usr/bin/ 247567 neutron   16u     unix 0xffff8e6d8abdb400      0t0 356522978 /tmp/tmp2afa3enn/privsep.sock type=STREAM

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1896734/+subscriptions



More information about the Ubuntu-openstack-bugs mailing list