[Bug 1894879] Re: frequent crashes when using do-resolve()
Rafael David Tinoco
1894879 at bugs.launchpad.net
Thu Sep 10 18:40:13 UTC 2020
+0003-BUG-CRITICAL-dns-Make-the-do-resolve-action-thread-safe.patch
This is available at 2020/07/23: 2.2.1
+0004-BUG-MEDIUM-dns-Release-answer-items-when-a-DNS-resolution-is-
freed.patch
This is available at 2020/07/23: 2.2.1
+0005-BUG-MEDIUM-dns-Dont-yield-in-do-resolve-action-on-a-final.patch
This one is available at 2020/07/31 : 2.2.2
So we are good in Groovy.
** Also affects: haproxy (Ubuntu Focal)
Importance: Undecided
Status: New
** Changed in: haproxy (Ubuntu)
Status: New => Fix Released
** Changed in: haproxy (Ubuntu Focal)
Status: New => Triaged
--
You received this bug notification because you are a member of Ubuntu
Sponsors Team, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/1894879
Title:
frequent crashes when using do-resolve()
Status in haproxy package in Ubuntu:
Fix Released
Status in haproxy source package in Focal:
Triaged
Bug description:
[Impact]
* using the do-resolve action causes frequent crashes (>15 times/day on
a given machine) and sometimes put haproxy into a bogus state where DNS
resolution stops working forever without crashing (DoS)
* to address the problem, 3 upstream fixes that were released in a later
micro release are backported. Those fixes address problems affecting
the do-resolve action that was introduced in the 2.0 branch (which is
why only Focal is affected)
[Test case]
Those crashes are due to the do-resolve action not being thread-safe. As
such, it is hard to trigger this on demand, except with production load.
The proposed fix should be tested for any regression, ideally on a SMP
enabled (multiple CPUs/cores) machine as the crashes happen more easily on those.
haproxy should be configured like in your production environment and put
to test to see if it still behaves like it should. The more concurrent
traffic the better, especially if you make use of the do-resolve action.
Any crash should be visible in the journal so you can keep an eye on
"journalctl -fu haproxy.service" during the tests. Abnormal exits should
be visible with:
journalctl -u haproxy --grep ALERT | grep -F 'exited with code '
With the proposed fix, there should be none even when under heavy
concurrent traffic.
[Regression Potential]
The backported fixes address locking and memory management issues. It's
possible they could introduce more locking problems or memory leaks. That
said, they only change code related to the DNS/do-resolve action which is
already known to trigger frequent crashes.
The backported fixes come from a later micro version (2.0.16, released on
2020-07-17) so it has seen real world production usage already. Upstream
also released another micro version since then and they didn't revert nor
rework the DNS/do-resolve action.
Furthermore, the patch was tested on a production machine that used to
crash >15 times per day. We are now 48 hours after the patch deployment
and we saw no crash and no sign of regression either.
[Original description]
haproxy 2.0.13-2 keeps crashing for us:
# journalctl --since yesterday -u haproxy --grep ALERT | grep -F 'exited with code '
Sep 07 18:14:57 foo haproxy[16831]: [ALERT] 250/181457 (16831) : Current worker #1 (16839) exited with code 139 (Segmentation fault)
Sep 07 19:45:23 foo haproxy[16876]: [ALERT] 250/194523 (16876) : Current worker #1 (16877) exited with code 139 (Segmentation fault)
Sep 07 19:49:01 foo haproxy[16916]: [ALERT] 250/194901 (16916) : Current worker #1 (16919) exited with code 139 (Segmentation fault)
Sep 07 19:49:02 foo haproxy[16939]: [ALERT] 250/194902 (16939) : Current worker #1 (16942) exited with code 139 (Segmentation fault)
Sep 07 19:49:03 foo haproxy[16953]: [ALERT] 250/194903 (16953) : Current worker #1 (16955) exited with code 139 (Segmentation fault)
Sep 07 19:49:37 foo haproxy[16964]: [ALERT] 250/194937 (16964) : Current worker #1 (16965) exited with code 139 (Segmentation fault)
Sep 07 23:41:13 foo haproxy[16982]: [ALERT] 250/234113 (16982) : Current worker #1 (16984) exited with code 139 (Segmentation fault)
Sep 07 23:41:14 foo haproxy[17076]: [ALERT] 250/234114 (17076) : Current worker #1 (17077) exited with code 139 (Segmentation fault)
Sep 07 23:43:20 foo haproxy[17090]: [ALERT] 250/234320 (17090) : Current worker #1 (17093) exited with code 139 (Segmentation fault)
Sep 07 23:43:50 foo haproxy[17113]: [ALERT] 250/234350 (17113) : Current worker #1 (17116) exited with code 139 (Segmentation fault)
Sep 07 23:43:51 foo haproxy[17134]: [ALERT] 250/234351 (17134) : Current worker #1 (17135) exited with code 139 (Segmentation fault)
Sep 07 23:44:44 foo haproxy[17146]: [ALERT] 250/234444 (17146) : Current worker #1 (17147) exited with code 139 (Segmentation fault)
Sep 08 00:18:54 foo haproxy[17164]: [ALERT] 251/001854 (17164) : Current worker #1 (17166) exited with code 134 (Aborted)
Sep 08 00:27:51 foo haproxy[17263]: [ALERT] 251/002751 (17263) : Current worker #1 (17266) exited with code 139 (Segmentation fault)
Sep 08 00:30:36 foo haproxy[17286]: [ALERT] 251/003036 (17286) : Current worker #1 (17289) exited with code 134 (Aborted)
Sep 08 00:37:01 foo haproxy[17307]: [ALERT] 251/003701 (17307) : Current worker #1 (17310) exited with code 139 (Segmentation fault)
Sep 08 00:40:31 foo haproxy[17331]: [ALERT] 251/004031 (17331) : Current worker #1 (17334) exited with code 139 (Segmentation fault)
Sep 08 00:41:14 foo haproxy[17650]: [ALERT] 251/004114 (17650) : Current worker #1 (17651) exited with code 139 (Segmentation fault)
Sep 08 00:41:59 foo haproxy[17669]: [ALERT] 251/004159 (17669) : Current worker #1 (17672) exited with code 139 (Segmentation fault)
The server in question uses a config that looks like this:
global
maxconn 50000
log /dev/log local0
log /dev/log local1 notice
chroot /var/lib/haproxy
stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
stats timeout 30s
user haproxy
group haproxy
daemon
defaults
maxconn 50000
log global
mode tcp
option tcplog
# dontlognull won't log sessions where the DNS resolution failed
#option dontlognull
timeout connect 5s
timeout client 15s
timeout server 15s
resolvers mydns
nameserver local 127.0.0.1:53
accepted_payload_size 1400
timeout resolve 1s
timeout retry 1s
hold other 30s
hold refused 30s
hold nx 30s
hold timeout 30s
hold valid 10s
hold obsolete 30s
frontend foo
...
# dns lookup
tcp-request content do-resolve(txn.remote_ip,mydns,ipv4) var(txn.remote_host)
tcp-request content capture var(txn.remote_ip) len 40
# XXX: the remaining rejections happen in the backend to provide better logging
# reject connection on DNS resolution error
use_backend be_dns_failed unless { var(txn.remote_ip) -m found }
...
# at this point, we should let the connection through
default_backend be_allowed
When reaching out to haproxy folks in #haproxy,
https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=ef131ae was
mentioned as a potential fix.
https://www.haproxy.org/bugs/bugs-2.0.13.html shows 3 commits with
"dns":
* https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=ef131ae
* https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=39eb766
* https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=74d704f
It would be great to have those (at least ef131ae) SRU'ed to Focal
(Bionic doesn't isn't affected as it runs on 1.8).
Additional information:
# lsb_release -rd
Description: Ubuntu 20.04.1 LTS
Release: 20.04
# apt-cache policy haproxy
haproxy:
Installed: 2.0.13-2
Candidate: 2.0.13-2
Version table:
*** 2.0.13-2 500
500 http://archive.ubuntu.com/ubuntu focal/main amd64 Packages
100 /var/lib/dpkg/status
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/haproxy/+bug/1894879/+subscriptions
More information about the Ubuntu-sponsors
mailing list