[Bug 1894879] Re: frequent crashes when using do-resolve()

Thu Sep 10 18:40:13 UTC 2020

+0003-BUG-CRITICAL-dns-Make-the-do-resolve-action-thread-safe.patch

This is available at 2020/07/23: 2.2.1

+0004-BUG-MEDIUM-dns-Release-answer-items-when-a-DNS-resolution-is-
freed.patch

This is available at 2020/07/23: 2.2.1

+0005-BUG-MEDIUM-dns-Dont-yield-in-do-resolve-action-on-a-final.patch

This one is available at 2020/07/31 : 2.2.2

So we are good in Groovy.

** Also affects: haproxy (Ubuntu Focal)
   Importance: Undecided
       Status: New

** Changed in: haproxy (Ubuntu)
       Status: New => Fix Released

** Changed in: haproxy (Ubuntu Focal)
       Status: New => Triaged

-- 
You received this bug notification because you are a member of Ubuntu
Sponsors Team, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/1894879

Title:
  frequent crashes when using do-resolve()

Status in haproxy package in Ubuntu:
  Fix Released
Status in haproxy source package in Focal:
  Triaged

Bug description:
  [Impact]

  * using the do-resolve action causes frequent crashes (>15 times/day on
    a given machine) and sometimes put haproxy into a bogus state where DNS
    resolution stops working forever without crashing (DoS)

  * to address the problem, 3 upstream fixes that were released in a later
    micro release are backported. Those fixes address problems affecting
    the do-resolve action that was introduced in the 2.0 branch (which is
    why only Focal is affected)

  [Test case]

  Those crashes are due to the do-resolve action not being thread-safe. As
  such, it is hard to trigger this on demand, except with production load.

  The proposed fix should be tested for any regression, ideally on a SMP
  enabled (multiple CPUs/cores) machine as the crashes happen more easily on those.

  haproxy should be configured like in your production environment and put
  to test to see if it still behaves like it should. The more concurrent
  traffic the better, especially if you make use of the do-resolve action.

  Any crash should be visible in the journal so you can keep an eye on
  "journalctl -fu haproxy.service" during the tests. Abnormal exits should
  be visible with:

   journalctl -u haproxy --grep ALERT | grep -F 'exited with code '

  With the proposed fix, there should be none even when under heavy
  concurrent traffic.

  [Regression Potential]

  The backported fixes address locking and memory management issues. It's
  possible they could introduce more locking problems or memory leaks. That
  said, they only change code related to the DNS/do-resolve action which is
  already known to trigger frequent crashes.

  The backported fixes come from a later micro version (2.0.16, released on
  2020-07-17) so it has seen real world production usage already. Upstream
  also released another micro version since then and they didn't revert nor
  rework the DNS/do-resolve action.

  Furthermore, the patch was tested on a production machine that used to
  crash >15 times per day. We are now 48 hours after the patch deployment
  and we saw no crash and no sign of regression either.

  [Original description]

  haproxy 2.0.13-2 keeps crashing for us:

  # journalctl --since yesterday -u haproxy --grep ALERT | grep -F 'exited with code '
  Sep 07 18:14:57 foo haproxy[16831]: [ALERT] 250/181457 (16831) : Current worker #1 (16839) exited with code 139 (Segmentation fault)
  Sep 07 19:45:23 foo haproxy[16876]: [ALERT] 250/194523 (16876) : Current worker #1 (16877) exited with code 139 (Segmentation fault)
  Sep 07 19:49:01 foo haproxy[16916]: [ALERT] 250/194901 (16916) : Current worker #1 (16919) exited with code 139 (Segmentation fault)
  Sep 07 19:49:02 foo haproxy[16939]: [ALERT] 250/194902 (16939) : Current worker #1 (16942) exited with code 139 (Segmentation fault)
  Sep 07 19:49:03 foo haproxy[16953]: [ALERT] 250/194903 (16953) : Current worker #1 (16955) exited with code 139 (Segmentation fault)
  Sep 07 19:49:37 foo haproxy[16964]: [ALERT] 250/194937 (16964) : Current worker #1 (16965) exited with code 139 (Segmentation fault)
  Sep 07 23:41:13 foo haproxy[16982]: [ALERT] 250/234113 (16982) : Current worker #1 (16984) exited with code 139 (Segmentation fault)
  Sep 07 23:41:14 foo haproxy[17076]: [ALERT] 250/234114 (17076) : Current worker #1 (17077) exited with code 139 (Segmentation fault)
  Sep 07 23:43:20 foo haproxy[17090]: [ALERT] 250/234320 (17090) : Current worker #1 (17093) exited with code 139 (Segmentation fault)
  Sep 07 23:43:50 foo haproxy[17113]: [ALERT] 250/234350 (17113) : Current worker #1 (17116) exited with code 139 (Segmentation fault)
  Sep 07 23:43:51 foo haproxy[17134]: [ALERT] 250/234351 (17134) : Current worker #1 (17135) exited with code 139 (Segmentation fault)
  Sep 07 23:44:44 foo haproxy[17146]: [ALERT] 250/234444 (17146) : Current worker #1 (17147) exited with code 139 (Segmentation fault)
  Sep 08 00:18:54 foo haproxy[17164]: [ALERT] 251/001854 (17164) : Current worker #1 (17166) exited with code 134 (Aborted)
  Sep 08 00:27:51 foo haproxy[17263]: [ALERT] 251/002751 (17263) : Current worker #1 (17266) exited with code 139 (Segmentation fault)
  Sep 08 00:30:36 foo haproxy[17286]: [ALERT] 251/003036 (17286) : Current worker #1 (17289) exited with code 134 (Aborted)
  Sep 08 00:37:01 foo haproxy[17307]: [ALERT] 251/003701 (17307) : Current worker #1 (17310) exited with code 139 (Segmentation fault)
  Sep 08 00:40:31 foo haproxy[17331]: [ALERT] 251/004031 (17331) : Current worker #1 (17334) exited with code 139 (Segmentation fault)
  Sep 08 00:41:14 foo haproxy[17650]: [ALERT] 251/004114 (17650) : Current worker #1 (17651) exited with code 139 (Segmentation fault)
  Sep 08 00:41:59 foo haproxy[17669]: [ALERT] 251/004159 (17669) : Current worker #1 (17672) exited with code 139 (Segmentation fault)

  The server in question uses a config that looks like this:

  global
    maxconn 50000
    log /dev/log    local0
    log /dev/log    local1 notice
    chroot /var/lib/haproxy
    stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
    stats timeout 30s
    user haproxy
    group haproxy
    daemon

  defaults
    maxconn 50000
    log     global
    mode    tcp
    option  tcplog
    # dontlognull won't log sessions where the DNS resolution failed
    #option    dontlognull
    timeout connect 5s
    timeout client  15s
    timeout server  15s

  resolvers mydns
    nameserver local 127.0.0.1:53
    accepted_payload_size 1400
    timeout resolve       1s
    timeout retry         1s
    hold other           30s
    hold refused         30s
    hold nx              30s
    hold timeout         30s
    hold valid           10s
    hold obsolete        30s

  frontend foo
    ...

    # dns lookup
    tcp-request content do-resolve(txn.remote_ip,mydns,ipv4) var(txn.remote_host)
    tcp-request content capture var(txn.remote_ip) len 40

    # XXX: the remaining rejections happen in the backend to provide better logging
    # reject connection on DNS resolution error
    use_backend be_dns_failed unless { var(txn.remote_ip) -m found }

    ...
    # at this point, we should let the connection through
    default_backend be_allowed

  When reaching out to haproxy folks in #haproxy,
  https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=ef131ae was
  mentioned as a potential fix.

  https://www.haproxy.org/bugs/bugs-2.0.13.html shows 3 commits with
  "dns":

  * https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=ef131ae
  * https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=39eb766
  * https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=74d704f

  It would be great to have those (at least ef131ae) SRU'ed to Focal
  (Bionic doesn't isn't affected as it runs on 1.8).

  Additional information:

  # lsb_release -rd
  Description:	Ubuntu 20.04.1 LTS
  Release:	20.04

  # apt-cache policy haproxy
  haproxy:
    Installed: 2.0.13-2
    Candidate: 2.0.13-2
    Version table:
   *** 2.0.13-2 500
          500 http://archive.ubuntu.com/ubuntu focal/main amd64 Packages
          100 /var/lib/dpkg/status

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/haproxy/+bug/1894879/+subscriptions