[SRU][Xenial][PATCH 1/1] ibmveth: Support to enable LSO/CSO for Trunk VEA.

Fri Nov 17 13:28:18 UTC 2017

Hi Joe,

Thanks - subscribed and commenting now!

Regards,
Daniel

On Sat, Nov 18, 2017 at 12:23 AM, Joseph Salisbury <
joseph.salisbury at canonical.com> wrote:

> On 08/25/2017 12:42 PM, Daniel Axtens wrote:
> > Hi,
> >
> > Sorry if I'm a little late to the party with this...
> >
> > In support we've seen a case where ibmveth + openvswitch + bnx2x has
> > lead to some issues, which you should probably be aware of before
> > turning on large segments in more places.
> >
> > Here's my summary from that support case, appropriately sanitised.
> >
> > ==========
> >
> > [Issue: we see a firmware assertion from an IBM branded bnx2x card.
> > Decoding the dump with the help of upstream shows that the assert is
> > caused by a packet with GSO on and gso_size > ~9700 bytes being passed
> > to the card. We traced the packets through the system, and came up
> > with this root cause. The system uses ibmveth to talk to AIX LPARs, a
> > bnx2x network card to talk to the world, and Open vSwitch to tie them
> > together. There is no VIOS involvement - the card is attached to the
> > Linux partition.]
> >
> > The packets causing the issue come through the ibmveth interface -
> > from the AIX LPAR. The veth protocol is 'special' - communication
> > between LPARs on the same chassis can use very large (64k) frames to
> > reduce overhead. Normal networks cannot handle such large packets, so
> > traditionally, the VIOS partition would signal to the AIX partitions
> > that it was 'special', and AIX would send regular, ethernet-sized
> > packets to VIOS, which VIOS would then send out.
> >
> > This signalling between VIOS and AIX is done in a way that is not
> > standards-compliant, and so was never made part of Linux. Instead, the
> > Linux driver has always understood large frames and passed them up the
> > network stack.
> >
> > In some cases (e.g. with TCP), multiple TCP segments are coalesced
> > into one large packet. In Linux, this goes through the generic receive
> > offload code, using a similar mechanism to GSO. These segments can be
> > very large which presents as a very large MSS (maximum segment size)
> > or gso_size.
> >
> > Normally, the large packet is simply passed to whatever network
> > application on Linux is going to consume it, and everything is OK.
> >
> > However, in this case, the packets go through Open vSwitch, and are
> > then passed to the bnx2x driver. The bnx2x driver/hardware supports
> > TSO and GSO, but with a restriction: the maximum segment size is
> > limited to around 9700 bytes. Normally this is more than adequate as
> > jumbo frames are limited to 9000 bytes. However, if a large packet
> > with large (>9700 byte) TCP segments arrives through ibmveth, and is
> > passed to bnx2x, the hardware will panic.
> >
> > Turning off TSO prevents the crash as the kernel resegments the data
> > and assembles the packets in software. This has a performance cost.
> >
> > Clearly at the very least, bnx2x should not crash in this case, and I
> > am working towards a patch for that.
> >
> > However, this still leaves us with some issues. The only thing the
> > bnx2x driver can sensibly do is drop the packet, which will prevent
> > the crash. However, there will still be issues with large packets:
> > when they are dropped, the other side will eventually realise that the
> > data is missing and ask for a retransmit, but the retransmit might
> > also be too big - there's no way of signalling back to the AIX LPAR
> > that it should reduce the MSS. Even if the data eventually gets
> > through there will be a latency/throughput/performance hit.
> >
> > ==========
> >
> > Seeing as IBM seems to be in active development in this area - indeed
> > this code explicitly deals with ibmveth + ovs, could we gently push
> > for this issue to be considered and fixed also?
> >
> > I'm happy to engage on launchpad or whatever the most appropriate
> channel is.
> >
> > Regards,
> > Daniel
> >
> > On Sat, Aug 26, 2017 at 2:23 AM, Joseph Salisbury
> > <joseph.salisbury at canonical.com> wrote:
> >> From: Sivakumar Krishnasamy <ksiva at linux.vnet.ibm.com>
> >>
> >> BugLink: http://bugs.launchpad.net/bugs/1692538
> >>
> >> Current largesend and checksum offload feature in ibmveth driver,
> >>  - Source VM sends the TCP packets with ip_summed field set as
> >>    CHECKSUM_PARTIAL and TCP pseudo header checksum is placed in
> >>    checksum field
> >>  - CHECKSUM_PARTIAL flag in SKB will enable ibmveth driver to mark
> >>    "no checksum" and "checksum good" bits in transmit buffer descriptor
> >>    before the packet is delivered to pseries PowerVM Hypervisor
> >>  - If ibmveth has largesend capability enabled, transmit buffer
> descriptors
> >>    are market accordingly before packet is delivered to Hypervisor
> >>    (along with mss value for packets with length > MSS)
> >>  - Destination VM's ibmveth driver receives the packet with "checksum
> good"
> >>    bit set and so, SKB's ip_summed field is set with
> CHECKSUM_UNNECESSARY
> >>  - If "largesend" bit was on, mss value is copied from receive
> descriptor
> >>    into SKB's gso_size and other flags are appropriately set for
> >>    packets > MSS size
> >>  - The packet is now successfully delivered up the stack in destination
> VM
> >>
> >> The offloads described above works fine for TCP communication among VMs
> in
> >> the same pseries server ( VM A <=> PowerVM Hypervisor <=> VM B )
> >>
> >> We are now enabling support for OVS in pseries PowerVM environment. One
> of
> >> our requirements is to have ibmveth driver configured in "Trunk" mode,
> when
> >> they are used with OVS. This is because, PowerVM Hypervisor will no more
> >> bridge the packets between VMs, instead the packets are delivered to
> >> IO Server which hosts OVS to bridge them between VMs or to external
> >> networks (flow shown below),
> >>   VM A <=> PowerVM Hypervisor <=> IO Server(OVS) <=> PowerVM Hypervisor
> >>                                                                    <=>
> VM B
> >> In "IO server" the packet is received by inbound Trunk ibmveth and then
> >> delivered to OVS, which is then bridged to outbound Trunk ibmveth (shown
> >> below),
> >>         Inbound Trunk ibmveth <=> OVS <=> Outbound Trunk ibmveth
> >>
> >> In this model, we hit the following issues which impacted the VM
> >> communication performance,
> >>
> >>  - Issue 1: ibmveth doesn't support largesend and checksum offload
> features
> >>    when configured as "Trunk". Driver has explicit checks to prevent
> >>    enabling these offloads.
> >>
> >>  - Issue 2: SYN packet drops seen at destination VM. When the packet
> >>    originates, it has CHECKSUM_PARTIAL flag set and as it gets
> delivered to
> >>    IO server's inbound Trunk ibmveth, on validating "checksum good" bits
> >>    in ibmveth receive routine, SKB's ip_summed field is set with
> >>    CHECKSUM_UNNECESSARY flag. This packet is then bridged by OVS (or
> Linux
> >>    Bridge) and delivered to outbound Trunk ibmveth. At this point the
> >>    outbound ibmveth transmit routine will not set "no checksum" and
> >>    "checksum good" bits in transmit buffer descriptor, as it does so
> only
> >>    when the ip_summed field is CHECKSUM_PARTIAL. When this packet gets
> >>    delivered to destination VM, TCP layer receives the packet with
> checksum
> >>    value of 0 and with no checksum related flags in ip_summed field.
> This
> >>    leads to packet drops. So, TCP connections never goes through fine.
> >>
> >>  - Issue 3: First packet of a TCP connection will be dropped, if there
> is
> >>    no OVS flow cached in datapath. OVS while trying to identify the
> flow,
> >>    computes the checksum. The computed checksum will be invalid at the
> >>    receiving end, as ibmveth transmit routine zeroes out the pseudo
> >>    checksum value in the packet. This leads to packet drop.
> >>
> >>  - Issue 4: ibmveth driver doesn't have support for SKB's with
> frag_list.
> >>    When Physical NIC has GRO enabled and when OVS bridges these packets,
> >>    OVS vport send code will end up calling dev_queue_xmit, which in turn
> >>    calls validate_xmit_skb.
> >>    In validate_xmit_skb routine, the larger packets will get segmented
> into
> >>    MSS sized segments, if SKB has a frag_list and if the driver to which
> >>    they are delivered to doesn't support NETIF_F_FRAGLIST feature.
> >>
> >> This patch addresses the above four issues, thereby enabling end to end
> >> largesend and checksum offload support for better performance.
> >>
> >>  - Fix for Issue 1 : Remove checks which prevent enabling TCP largesend
> and
> >>    checksum offloads.
> >>  - Fix for Issue 2 : When ibmveth receives a packet with "checksum good"
> >>    bit set and if its configured in Trunk mode, set appropriate SKB
> fields
> >>    using skb_partial_csum_set (ip_summed field is set with
> >>    CHECKSUM_PARTIAL)
> >>  - Fix for Issue 3: Recompute the pseudo header checksum before sending
> the
> >>    SKB up the stack.
> >>  - Fix for Issue 4: Linearize the SKBs with frag_list. Though we end up
> >>    allocating buffers and copying data, this fix gives
> >>    upto 4X throughput increase.
> >>
> >> Note: All these fixes need to be dropped together as fixing just one of
> >> them will lead to other issues immediately (especially for Issues 1,2 &
> 3).
> >>
> >> Signed-off-by: Sivakumar Krishnasamy <ksiva at linux.vnet.ibm.com>
> >> Signed-off-by: David S. Miller <davem at davemloft.net>
> >> (back ported from commit 66aa0678efc29abd2ab02a09b23f9a8bc9f12a6c)
> >> Signed-off-by: Joseph Salisbury <joseph.salisbury at canonical.com>
> >> ---
> >>  drivers/net/ethernet/ibm/ibmveth.c | 107
> ++++++++++++++++++++++++++++++-------
> >>  drivers/net/ethernet/ibm/ibmveth.h |   1 +
> >>  2 files changed, 90 insertions(+), 18 deletions(-)
> >>
> >> diff --git a/drivers/net/ethernet/ibm/ibmveth.c
> b/drivers/net/ethernet/ibm/ibmveth.c
> >> index 2f9b12c..0482491 100644
> >> --- a/drivers/net/ethernet/ibm/ibmveth.c
> >> +++ b/drivers/net/ethernet/ibm/ibmveth.c
> >> @@ -46,6 +46,8 @@
> >>  #include <asm/vio.h>
> >>  #include <asm/iommu.h>
> >>  #include <asm/firmware.h>
> >> +#include <net/tcp.h>
> >> +#include <net/ip6_checksum.h>
> >>
> >>  #include "ibmveth.h"
> >>
> >> @@ -804,8 +806,7 @@ static int ibmveth_set_csum_offload(struct
> net_device *dev, u32 data)
> >>
> >>         ret = h_illan_attributes(adapter->vdev->unit_address, 0, 0,
> &ret_attr);
> >>
> >> -       if (ret == H_SUCCESS && !(ret_attr &
> IBMVETH_ILLAN_ACTIVE_TRUNK) &&
> >> -           !(ret_attr & IBMVETH_ILLAN_TRUNK_PRI_MASK) &&
> >> +       if (ret == H_SUCCESS &&
> >>             (ret_attr & IBMVETH_ILLAN_PADDED_PKT_CSUM)) {
> >>                 ret4 = h_illan_attributes(adapter->vdev->unit_address,
> clr_attr,
> >>                                          set_attr, &ret_attr);
> >> @@ -1035,6 +1036,15 @@ static netdev_tx_t ibmveth_start_xmit(struct
> sk_buff *skb,
> >>         dma_addr_t dma_addr;
> >>         unsigned long mss = 0;
> >>
> >> +       /* veth doesn't handle frag_list, so linearize the skb.
> >> +        * When GRO is enabled SKB's can have frag_list.
> >> +        */
> >> +       if (adapter->is_active_trunk &&
> >> +           skb_has_frag_list(skb) && __skb_linearize(skb)) {
> >> +               netdev->stats.tx_dropped++;
> >> +               goto out;
> >> +       }
> >> +
> >>         /*
> >>          * veth handles a maximum of 6 segments including the header, so
> >>          * we have to linearize the skb if there are more than this.
> >> @@ -1059,9 +1069,6 @@ static netdev_tx_t ibmveth_start_xmit(struct
> sk_buff *skb,
> >>
> >>         desc_flags = IBMVETH_BUF_VALID;
> >>
> >> -       if (skb_is_gso(skb) && adapter->fw_large_send_support)
> >> -               desc_flags |= IBMVETH_BUF_LRG_SND;
> >> -
> >>         if (skb->ip_summed == CHECKSUM_PARTIAL) {
> >>                 unsigned char *buf = skb_transport_header(skb) +
> >>                                                 skb->csum_offset;
> >> @@ -1071,6 +1078,9 @@ static netdev_tx_t ibmveth_start_xmit(struct
> sk_buff *skb,
> >>                 /* Need to zero out the checksum */
> >>                 buf[0] = 0;
> >>                 buf[1] = 0;
> >> +
> >> +               if (skb_is_gso(skb) && adapter->fw_large_send_support)
> >> +                       desc_flags |= IBMVETH_BUF_LRG_SND;
> >>         }
> >>
> >>  retry_bounce:
> >> @@ -1123,7 +1133,7 @@ retry_bounce:
> >>                 descs[i+1].fields.address = dma_addr;
> >>         }
> >>
> >> -       if (skb_is_gso(skb)) {
> >> +       if (skb->ip_summed == CHECKSUM_PARTIAL && skb_is_gso(skb)) {
> >>                 if (adapter->fw_large_send_support) {
> >>                         mss = (unsigned long)skb_shinfo(skb)->gso_size;
> >>                         adapter->tx_large_packets++;
> >> @@ -1224,6 +1234,71 @@ static void ibmveth_rx_mss_helper(struct sk_buff
> *skb, u16 mss, int lrg_pkt)
> >>         }
> >>  }
> >>
> >> +static void ibmveth_rx_csum_helper(struct sk_buff *skb,
> >> +                                  struct ibmveth_adapter *adapter)
> >> +{
> >> +       struct iphdr *iph = NULL;
> >> +       struct ipv6hdr *iph6 = NULL;
> >> +       __be16 skb_proto = 0;
> >> +       u16 iphlen = 0;
> >> +       u16 iph_proto = 0;
> >> +       u16 tcphdrlen = 0;
> >> +
> >> +       skb_proto = be16_to_cpu(skb->protocol);
> >> +
> >> +       if (skb_proto == ETH_P_IP) {
> >> +               iph = (struct iphdr *)skb->data;
> >> +
> >> +               /* If the IP checksum is not offloaded and if the packet
> >> +                *  is large send, the checksum must be rebuilt.
> >> +                */
> >> +               if (iph->check == 0xffff) {
> >> +                       iph->check = 0;
> >> +                       iph->check = ip_fast_csum((unsigned char *)iph,
> >> +                                                 iph->ihl);
> >> +               }
> >> +
> >> +               iphlen = iph->ihl * 4;
> >> +               iph_proto = iph->protocol;
> >> +       } else if (skb_proto == ETH_P_IPV6) {
> >> +               iph6 = (struct ipv6hdr *)skb->data;
> >> +               iphlen = sizeof(struct ipv6hdr);
> >> +               iph_proto = iph6->nexthdr;
> >> +       }
> >> +
> >> +       /* In OVS environment, when a flow is not cached, specifically
> for a
> >> +        * new TCP connection, the first packet information is passed up
> >> +        * the user space for finding a flow. During this process, OVS
> computes
> >> +        * checksum on the first packet when CHECKSUM_PARTIAL flag is
> set.
> >> +        *
> >> +        * Given that we zeroed out TCP checksum field in transmit path
> >> +        * (refer ibmveth_start_xmit routine) as we set "no checksum
> bit",
> >> +        * OVS computed checksum will be incorrect w/o TCP pseudo
> checksum
> >> +        * in the packet. This leads to OVS dropping the packet and
> hence
> >> +        * TCP retransmissions are seen.
> >> +        *
> >> +        * So, re-compute TCP pseudo header checksum.
> >> +        */
> >> +       if (iph_proto == IPPROTO_TCP && adapter->is_active_trunk) {
> >> +               struct tcphdr *tcph = (struct tcphdr *)(skb->data +
> iphlen);
> >> +
> >> +               tcphdrlen = skb->len - iphlen;
> >> +
> >> +               /* Recompute TCP pseudo header checksum */
> >> +               if (skb_proto == ETH_P_IP)
> >> +                       tcph->check = ~csum_tcpudp_magic(iph->saddr,
> >> +                                       iph->daddr, tcphdrlen,
> iph_proto, 0);
> >> +               else if (skb_proto == ETH_P_IPV6)
> >> +                       tcph->check = ~csum_ipv6_magic(&iph6->saddr,
> >> +                                       &iph6->daddr, tcphdrlen,
> iph_proto, 0);
> >> +
> >> +               /* Setup SKB fields for checksum offload */
> >> +               skb_partial_csum_set(skb, iphlen,
> >> +                                    offsetof(struct tcphdr, check));
> >> +               skb_reset_network_header(skb);
> >> +       }
> >> +}
> >> +
> >>  static int ibmveth_poll(struct napi_struct *napi, int budget)
> >>  {
> >>         struct ibmveth_adapter *adapter =
> >> @@ -1231,7 +1306,6 @@ static int ibmveth_poll(struct napi_struct *napi,
> int budget)
> >>         struct net_device *netdev = adapter->netdev;
> >>         int frames_processed = 0;
> >>         unsigned long lpar_rc;
> >> -       struct iphdr *iph;
> >>         u16 mss = 0;
> >>
> >>  restart_poll:
> >> @@ -1289,17 +1363,7 @@ restart_poll:
> >>
> >>                         if (csum_good) {
> >>                                 skb->ip_summed = CHECKSUM_UNNECESSARY;
> >> -                               if (be16_to_cpu(skb->protocol) ==
> ETH_P_IP) {
> >> -                                       iph = (struct iphdr *)skb->data;
> >> -
> >> -                                       /* If the IP checksum is not
> offloaded and if the packet
> >> -                                        *  is large send, the checksum
> must be rebuilt.
> >> -                                        */
> >> -                                       if (iph->check == 0xffff) {
> >> -                                               iph->check = 0;
> >> -                                               iph->check =
> ip_fast_csum((unsigned char *)iph, iph->ihl);
> >> -                                       }
> >> -                               }
> >> +                               ibmveth_rx_csum_helper(skb, adapter);
> >>                         }
> >>
> >>                         if (length > netdev->mtu + ETH_HLEN) {
> >> @@ -1621,6 +1685,13 @@ static int ibmveth_probe(struct vio_dev *dev,
> const struct vio_device_id *id)
> >>                 netdev->hw_features |= NETIF_F_TSO;
> >>         }
> >>
> >> +       adapter->is_active_trunk = false;
> >> +       if (ret == H_SUCCESS && (ret_attr &
> IBMVETH_ILLAN_ACTIVE_TRUNK)) {
> >> +               adapter->is_active_trunk = true;
> >> +               netdev->hw_features |= NETIF_F_FRAGLIST;
> >> +               netdev->features |= NETIF_F_FRAGLIST;
> >> +       }
> >> +
> >>         memcpy(netdev->dev_addr, mac_addr_p, ETH_ALEN);
> >>
> >>         if (firmware_has_feature(FW_FEATURE_CMO))
> >> diff --git a/drivers/net/ethernet/ibm/ibmveth.h
> b/drivers/net/ethernet/ibm/ibmveth.h
> >> index 7acda04..de6e381 100644
> >> --- a/drivers/net/ethernet/ibm/ibmveth.h
> >> +++ b/drivers/net/ethernet/ibm/ibmveth.h
> >> @@ -157,6 +157,7 @@ struct ibmveth_adapter {
> >>      int pool_config;
> >>      int rx_csum;
> >>      int large_send;
> >> +    bool is_active_trunk;
> >>      void *bounce_buffer;
> >>      dma_addr_t bounce_buffer_dma;
> >>
> >> --
> >> 2.7.4
> >>
> >>
> >> --
> >> kernel-team mailing list
> >> kernel-team at lists.ubuntu.com
> >> https://lists.ubuntu.com/mailman/listinfo/kernel-team
> Hi Daniel,
>
> The SRU request for Xenial is on hold.  I forwarded the information you
> provided to IBM.  Thanks for the detailed write up!  I'll keep you up to
> date on any feedback they provide.
>
> You could also subscribe to the bug report, and feel free to comment as
> needed:
> http://pad.lv/1692538
>
> Thanks,
>
> Joe
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20171118/7c2b3987/attachment.html>