Cmt: [SRU][B/aws, F/aws, G/aws][PATCH 0/1] aws: fix network performance regression due to initial TCP buffer size change

Andrea Righi andrea.righi at canonical.com
Tue Jan 5 09:10:11 UTC 2021


On Tue, Jan 05, 2021 at 08:53:51AM +0000, Colin Ian King wrote:
> On 05/01/2021 07:44, Andrea Righi wrote:
> > BugLink: https://bugs.launchpad.net/bugs/1910200
> > 
> > [Impact]
> > 
> > AWS has seen some customers reporting networking performance degradation
> > after they upgraded their Ubuntu instanceses. This regression is highly
> > impacting customers who are using MTU=9000 (which is the default in
> > EC2).
> > 
> > [Test case]
> > 
> > Bug reproduced internally in AWS (no test case provided), but apparently
> > it is very easy to reproduce simply by measuring networking performance.
> > 
> > [Fix]
> > 
> > AWS worked internally and found that this regression has been introduced
> > by:
> > 
> >  a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
> > 
> > To solve this problem we need to apply the following upstream commit
> > that explicitly fixes the problem introduced by the commit above:
> > 
> >  33ae7b5bb841 ("tcp: select sane initial rcvq_space.space for big MSS")
> > 
> > [Regression potential]
> > 
> > Upstream fix that is only affecting the initial TCP buffer space and
> > allows the TCP window size to be dynamically increased, basically
> > restoring the previous (correct) behavior, so regression potential is
> > minimal.
> > 
> > 
> I can't seem to find the upstream fix from the sha 33ae7b5bb841, is that
> a typo?  Also, does it make sense to apply this for non-AWS kernels too?

You're right! The commit sha is wrong, the correct one is this:

  72d05c00d7ec ("tcp: select sane initial rcvq_space.space for big MSS")

I think I used the sha after applying the commit. :)

And I think it makes sense to apply it also to non-AWS kernels,
potentially all users that are using jumbo frames are affected by this
regression, but it's definitely more urgent to apply it to the AWS
kernel, because it's directly affecting some users and the performance
regression is significant.

Moreover, it's worth mentioning that there is also a user-space
workaround for those that are affected (increasing the tcp rmem size via
/proc/sys/net/ipv4/tcp_rmem - I forgot to mention this in the
description above), so personally I don't see this as a super critical
fix to apply right now, all the kernels will naturally receive it via
the regular SRU updates.

Thoughts?

-Andrea



More information about the kernel-team mailing list