Recommended consultants for TCP message loss issue?
Rich K
rk5devmail at gmail.com
Wed Apr 13 13:15:53 UTC 2022
We have a strange data loss issue that so far we haven't been able to get
to the root cause.
The main point of this question is to get recommendations on consultants we
can hire to help us troubleshoot. We're willing to bring in paid experts
here, but not sure about what company to even ask.
Problem details in case anyone has any suggestions:
We have an application that sends json messages over TCP to an Ubuntu 18.04
box as a place to aggregate logs. We use syslog-ng on the system to collect
the logs and then forward them downstream to another system. We've deployed
the same setup at multiple customer sites, but this particular site we are
missing messages after they enter the server. How do we know?
- Each message has a monotonically increasing sequence number. As a simple
example, we can see that we receive messages 1,2,3 and then miss 4,5,6, and
then get 7,8,9. Average message size is 1kb, occasionally 10kb but that
size is rare.
- Using tcpdump we can also see that ALL the messages come into the
server which we believe rules out packet loss (retrans rate is less than
.1%). 100% is entering the NIC as best we can tell
- Thinking this was an issue with syslog-ng we replaced it with fluentd,
but still experienced message loss (we tried a bunch of additional
syslog-ng tuning - flow control etc. also)
- The tcp buffer is set to 40 gigabits which should be more than enough for
our traffic volume.
- There are no obvious errors that we see, but happy to check other logs we
may have missed if anyone has recommendations.
- Resource usage seems normal and the server does not appear stressed
- This is a VM running on ESXi version ESXi6.5u2-10719125_20190308.
netstat -s output:
Ip:
Forwarding: 1
4593164861 total packets received
4142886767 forwarded
0 incoming packets discarded
300256068 incoming packets delivered
4592170310 requests sent out
6968 dropped because of missing route
157 fragments failed
Icmp:
14625192 ICMP messages received
14392207 input ICMP message failed
echo replies: 93936
timestamp request: 18
address mask request: 9
14500776 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 14401103
echo requests: 99529
echo replies: 126
timestamp replies: 18
IcmpMsg:
InType0: 93936
InType3: 14411285
InType8: 126
InType11: 119818
InType13: 18
InType17: 9
OutType0: 126
OutType3: 14401103
OutType8: 99529
OutType14: 18
Tcp:
49655 active connection openings
922 passive connection openings
25581 failed connection attempts
24 connection resets received
4 connections established
255130475 segments received
418782284 segments sent out
297424 segments retransmitted
84 bad segments received
44527 resets sent
Udp:
9568375 packets received
771038 packets to unknown port received
0 packet receive errors
30315599 packets sent
0 receive buffer errors
0 send buffer errors
IgnoredMulti: 20160522
UdpLite:
TcpExt:
160 resets received for embryonic SYN_RECV sockets
43 ICMP packets dropped because they were out-of-window
160 TCP sockets finished time wait in fast timer
83 packetes rejected in established connections because of timestamp
10465104 delayed acks sent
14675 delayed acks further delayed because of locked socket
Quick ack mode was activated 7761 times
30653113 packet headers predicted
1 congestion windows fully recovered without slow start
1 congestion windows partially recovered using Hoe heuristic
TCPDSACKUndo: 1053
16215 congestion windows recovered without slow start after partial ack
TCPLostRetransmit: 218
TCPSackFailures: 2562
62 timeouts in loss state
12712 fast retransmits
805 retransmits in slow start
TCPTimeouts: 259506
TCPLossProbes: 51861
TCPLossProbeRecovery: 175
TCPSackRecoveryFail: 72
TCPDSACKOldSent: 7762
TCPDSACKOfoSent: 150
TCPDSACKRecv: 67553
TCPDSACKOfoRecv: 122
20 connections reset due to unexpected data
10 connections reset due to early user close
25203 connections aborted due to timeout
TCPDSACKIgnoredOld: 875
TCPDSACKIgnoredNoUndo: 59255
TCPSpuriousRTOs: 180
TCPSackShifted: 2918
TCPSackMerged: 28194
TCPSackShiftFallback: 20170
IPReversePathFilter: 28269
TCPRcvCoalesce: 919433
TCPOFOQueue: 94994
TCPOFOMerge: 150
TCPChallengeACK: 81
TCPSYNChallenge: 85
TCPSpuriousRtxHostQueues: 23
TCPAutoCorking: 19246152
TCPFromZeroWindowAdv: 64
TCPToZeroWindowAdv: 64
TCPWantZeroWindowAdv: 155
TCPSynRetrans: 214009
TCPOrigDataSent: 393676699
TCPHystartTrainDetect: 4
TCPHystartTrainCwnd: 107
TCPHystartDelayDetect: 1450
TCPHystartDelayCwnd: 29424
TCPACKSkippedPAWS: 59
TCPACKSkippedSeq: 73
TCPACKSkippedChallenge: 4
TCPWinProbe: 437
TCPKeepAlive: 474
IpExt:
InMcastPkts: 343
OutMcastPkts: 19138
InBcastPkts: 20160522
InOctets: 2691133820485
OutOctets: 5716198198735
InMcastOctets: 12348
OutMcastOctets: 2334836
InBcastOctets: 1672743338
InNoECTPkts: 7175709883
InECT0Pkts: 90382743
Sctp:
0 Current Associations
0 Active Associations
0 Passive Associations
0 Number of Aborteds
0 Number of Graceful Terminations
0 Number of Out of Blue packets
0 Number of Packets with invalid Checksum
0 Number of control chunks sent
0 Number of ordered chunks sent
0 Number of Unordered chunks sent
0 Number of control chunks received
0 Number of ordered chunks received
0 Number of Unordered chunks received
0 Number of messages fragmented
0 Number of messages reassembled
0 Number of SCTP packets sent
0 Number of SCTP packets received
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/ubuntu-users/attachments/20220413/fcf03474/attachment.html>
More information about the ubuntu-users
mailing list