[SRU][Xenial][PATCH 1/1] ibmveth: Support to enable LSO/CSO for Trunk VEA.
Daniel Axtens
daniel.axtens at canonical.com
Fri Nov 17 13:28:18 UTC 2017
Hi Joe,
Thanks - subscribed and commenting now!
Regards,
Daniel
On Sat, Nov 18, 2017 at 12:23 AM, Joseph Salisbury <
joseph.salisbury at canonical.com> wrote:
> On 08/25/2017 12:42 PM, Daniel Axtens wrote:
> > Hi,
> >
> > Sorry if I'm a little late to the party with this...
> >
> > In support we've seen a case where ibmveth + openvswitch + bnx2x has
> > lead to some issues, which you should probably be aware of before
> > turning on large segments in more places.
> >
> > Here's my summary from that support case, appropriately sanitised.
> >
> > ==========
> >
> > [Issue: we see a firmware assertion from an IBM branded bnx2x card.
> > Decoding the dump with the help of upstream shows that the assert is
> > caused by a packet with GSO on and gso_size > ~9700 bytes being passed
> > to the card. We traced the packets through the system, and came up
> > with this root cause. The system uses ibmveth to talk to AIX LPARs, a
> > bnx2x network card to talk to the world, and Open vSwitch to tie them
> > together. There is no VIOS involvement - the card is attached to the
> > Linux partition.]
> >
> > The packets causing the issue come through the ibmveth interface -
> > from the AIX LPAR. The veth protocol is 'special' - communication
> > between LPARs on the same chassis can use very large (64k) frames to
> > reduce overhead. Normal networks cannot handle such large packets, so
> > traditionally, the VIOS partition would signal to the AIX partitions
> > that it was 'special', and AIX would send regular, ethernet-sized
> > packets to VIOS, which VIOS would then send out.
> >
> > This signalling between VIOS and AIX is done in a way that is not
> > standards-compliant, and so was never made part of Linux. Instead, the
> > Linux driver has always understood large frames and passed them up the
> > network stack.
> >
> > In some cases (e.g. with TCP), multiple TCP segments are coalesced
> > into one large packet. In Linux, this goes through the generic receive
> > offload code, using a similar mechanism to GSO. These segments can be
> > very large which presents as a very large MSS (maximum segment size)
> > or gso_size.
> >
> > Normally, the large packet is simply passed to whatever network
> > application on Linux is going to consume it, and everything is OK.
> >
> > However, in this case, the packets go through Open vSwitch, and are
> > then passed to the bnx2x driver. The bnx2x driver/hardware supports
> > TSO and GSO, but with a restriction: the maximum segment size is
> > limited to around 9700 bytes. Normally this is more than adequate as
> > jumbo frames are limited to 9000 bytes. However, if a large packet
> > with large (>9700 byte) TCP segments arrives through ibmveth, and is
> > passed to bnx2x, the hardware will panic.
> >
> > Turning off TSO prevents the crash as the kernel resegments the data
> > and assembles the packets in software. This has a performance cost.
> >
> > Clearly at the very least, bnx2x should not crash in this case, and I
> > am working towards a patch for that.
> >
> > However, this still leaves us with some issues. The only thing the
> > bnx2x driver can sensibly do is drop the packet, which will prevent
> > the crash. However, there will still be issues with large packets:
> > when they are dropped, the other side will eventually realise that the
> > data is missing and ask for a retransmit, but the retransmit might
> > also be too big - there's no way of signalling back to the AIX LPAR
> > that it should reduce the MSS. Even if the data eventually gets
> > through there will be a latency/throughput/performance hit.
> >
> > ==========
> >
> > Seeing as IBM seems to be in active development in this area - indeed
> > this code explicitly deals with ibmveth + ovs, could we gently push
> > for this issue to be considered and fixed also?
> >
> > I'm happy to engage on launchpad or whatever the most appropriate
> channel is.
> >
> > Regards,
> > Daniel
> >
> > On Sat, Aug 26, 2017 at 2:23 AM, Joseph Salisbury
> > <joseph.salisbury at canonical.com> wrote:
> >> From: Sivakumar Krishnasamy <ksiva at linux.vnet.ibm.com>
> >>
> >> BugLink: http://bugs.launchpad.net/bugs/1692538
> >>
> >> Current largesend and checksum offload feature in ibmveth driver,
> >> - Source VM sends the TCP packets with ip_summed field set as
> >> CHECKSUM_PARTIAL and TCP pseudo header checksum is placed in
> >> checksum field
> >> - CHECKSUM_PARTIAL flag in SKB will enable ibmveth driver to mark
> >> "no checksum" and "checksum good" bits in transmit buffer descriptor
> >> before the packet is delivered to pseries PowerVM Hypervisor
> >> - If ibmveth has largesend capability enabled, transmit buffer
> descriptors
> >> are market accordingly before packet is delivered to Hypervisor
> >> (along with mss value for packets with length > MSS)
> >> - Destination VM's ibmveth driver receives the packet with "checksum
> good"
> >> bit set and so, SKB's ip_summed field is set with
> CHECKSUM_UNNECESSARY
> >> - If "largesend" bit was on, mss value is copied from receive
> descriptor
> >> into SKB's gso_size and other flags are appropriately set for
> >> packets > MSS size
> >> - The packet is now successfully delivered up the stack in destination
> VM
> >>
> >> The offloads described above works fine for TCP communication among VMs
> in
> >> the same pseries server ( VM A <=> PowerVM Hypervisor <=> VM B )
> >>
> >> We are now enabling support for OVS in pseries PowerVM environment. One
> of
> >> our requirements is to have ibmveth driver configured in "Trunk" mode,
> when
> >> they are used with OVS. This is because, PowerVM Hypervisor will no more
> >> bridge the packets between VMs, instead the packets are delivered to
> >> IO Server which hosts OVS to bridge them between VMs or to external
> >> networks (flow shown below),
> >> VM A <=> PowerVM Hypervisor <=> IO Server(OVS) <=> PowerVM Hypervisor
> >> <=>
> VM B
> >> In "IO server" the packet is received by inbound Trunk ibmveth and then
> >> delivered to OVS, which is then bridged to outbound Trunk ibmveth (shown
> >> below),
> >> Inbound Trunk ibmveth <=> OVS <=> Outbound Trunk ibmveth
> >>
> >> In this model, we hit the following issues which impacted the VM
> >> communication performance,
> >>
> >> - Issue 1: ibmveth doesn't support largesend and checksum offload
> features
> >> when configured as "Trunk". Driver has explicit checks to prevent
> >> enabling these offloads.
> >>
> >> - Issue 2: SYN packet drops seen at destination VM. When the packet
> >> originates, it has CHECKSUM_PARTIAL flag set and as it gets
> delivered to
> >> IO server's inbound Trunk ibmveth, on validating "checksum good" bits
> >> in ibmveth receive routine, SKB's ip_summed field is set with
> >> CHECKSUM_UNNECESSARY flag. This packet is then bridged by OVS (or
> Linux
> >> Bridge) and delivered to outbound Trunk ibmveth. At this point the
> >> outbound ibmveth transmit routine will not set "no checksum" and
> >> "checksum good" bits in transmit buffer descriptor, as it does so
> only
> >> when the ip_summed field is CHECKSUM_PARTIAL. When this packet gets
> >> delivered to destination VM, TCP layer receives the packet with
> checksum
> >> value of 0 and with no checksum related flags in ip_summed field.
> This
> >> leads to packet drops. So, TCP connections never goes through fine.
> >>
> >> - Issue 3: First packet of a TCP connection will be dropped, if there
> is
> >> no OVS flow cached in datapath. OVS while trying to identify the
> flow,
> >> computes the checksum. The computed checksum will be invalid at the
> >> receiving end, as ibmveth transmit routine zeroes out the pseudo
> >> checksum value in the packet. This leads to packet drop.
> >>
> >> - Issue 4: ibmveth driver doesn't have support for SKB's with
> frag_list.
> >> When Physical NIC has GRO enabled and when OVS bridges these packets,
> >> OVS vport send code will end up calling dev_queue_xmit, which in turn
> >> calls validate_xmit_skb.
> >> In validate_xmit_skb routine, the larger packets will get segmented
> into
> >> MSS sized segments, if SKB has a frag_list and if the driver to which
> >> they are delivered to doesn't support NETIF_F_FRAGLIST feature.
> >>
> >> This patch addresses the above four issues, thereby enabling end to end
> >> largesend and checksum offload support for better performance.
> >>
> >> - Fix for Issue 1 : Remove checks which prevent enabling TCP largesend
> and
> >> checksum offloads.
> >> - Fix for Issue 2 : When ibmveth receives a packet with "checksum good"
> >> bit set and if its configured in Trunk mode, set appropriate SKB
> fields
> >> using skb_partial_csum_set (ip_summed field is set with
> >> CHECKSUM_PARTIAL)
> >> - Fix for Issue 3: Recompute the pseudo header checksum before sending
> the
> >> SKB up the stack.
> >> - Fix for Issue 4: Linearize the SKBs with frag_list. Though we end up
> >> allocating buffers and copying data, this fix gives
> >> upto 4X throughput increase.
> >>
> >> Note: All these fixes need to be dropped together as fixing just one of
> >> them will lead to other issues immediately (especially for Issues 1,2 &
> 3).
> >>
> >> Signed-off-by: Sivakumar Krishnasamy <ksiva at linux.vnet.ibm.com>
> >> Signed-off-by: David S. Miller <davem at davemloft.net>
> >> (back ported from commit 66aa0678efc29abd2ab02a09b23f9a8bc9f12a6c)
> >> Signed-off-by: Joseph Salisbury <joseph.salisbury at canonical.com>
> >> ---
> >> drivers/net/ethernet/ibm/ibmveth.c | 107
> ++++++++++++++++++++++++++++++-------
> >> drivers/net/ethernet/ibm/ibmveth.h | 1 +
> >> 2 files changed, 90 insertions(+), 18 deletions(-)
> >>
> >> diff --git a/drivers/net/ethernet/ibm/ibmveth.c
> b/drivers/net/ethernet/ibm/ibmveth.c
> >> index 2f9b12c..0482491 100644
> >> --- a/drivers/net/ethernet/ibm/ibmveth.c
> >> +++ b/drivers/net/ethernet/ibm/ibmveth.c
> >> @@ -46,6 +46,8 @@
> >> #include <asm/vio.h>
> >> #include <asm/iommu.h>
> >> #include <asm/firmware.h>
> >> +#include <net/tcp.h>
> >> +#include <net/ip6_checksum.h>
> >>
> >> #include "ibmveth.h"
> >>
> >> @@ -804,8 +806,7 @@ static int ibmveth_set_csum_offload(struct
> net_device *dev, u32 data)
> >>
> >> ret = h_illan_attributes(adapter->vdev->unit_address, 0, 0,
> &ret_attr);
> >>
> >> - if (ret == H_SUCCESS && !(ret_attr &
> IBMVETH_ILLAN_ACTIVE_TRUNK) &&
> >> - !(ret_attr & IBMVETH_ILLAN_TRUNK_PRI_MASK) &&
> >> + if (ret == H_SUCCESS &&
> >> (ret_attr & IBMVETH_ILLAN_PADDED_PKT_CSUM)) {
> >> ret4 = h_illan_attributes(adapter->vdev->unit_address,
> clr_attr,
> >> set_attr, &ret_attr);
> >> @@ -1035,6 +1036,15 @@ static netdev_tx_t ibmveth_start_xmit(struct
> sk_buff *skb,
> >> dma_addr_t dma_addr;
> >> unsigned long mss = 0;
> >>
> >> + /* veth doesn't handle frag_list, so linearize the skb.
> >> + * When GRO is enabled SKB's can have frag_list.
> >> + */
> >> + if (adapter->is_active_trunk &&
> >> + skb_has_frag_list(skb) && __skb_linearize(skb)) {
> >> + netdev->stats.tx_dropped++;
> >> + goto out;
> >> + }
> >> +
> >> /*
> >> * veth handles a maximum of 6 segments including the header, so
> >> * we have to linearize the skb if there are more than this.
> >> @@ -1059,9 +1069,6 @@ static netdev_tx_t ibmveth_start_xmit(struct
> sk_buff *skb,
> >>
> >> desc_flags = IBMVETH_BUF_VALID;
> >>
> >> - if (skb_is_gso(skb) && adapter->fw_large_send_support)
> >> - desc_flags |= IBMVETH_BUF_LRG_SND;
> >> -
> >> if (skb->ip_summed == CHECKSUM_PARTIAL) {
> >> unsigned char *buf = skb_transport_header(skb) +
> >> skb->csum_offset;
> >> @@ -1071,6 +1078,9 @@ static netdev_tx_t ibmveth_start_xmit(struct
> sk_buff *skb,
> >> /* Need to zero out the checksum */
> >> buf[0] = 0;
> >> buf[1] = 0;
> >> +
> >> + if (skb_is_gso(skb) && adapter->fw_large_send_support)
> >> + desc_flags |= IBMVETH_BUF_LRG_SND;
> >> }
> >>
> >> retry_bounce:
> >> @@ -1123,7 +1133,7 @@ retry_bounce:
> >> descs[i+1].fields.address = dma_addr;
> >> }
> >>
> >> - if (skb_is_gso(skb)) {
> >> + if (skb->ip_summed == CHECKSUM_PARTIAL && skb_is_gso(skb)) {
> >> if (adapter->fw_large_send_support) {
> >> mss = (unsigned long)skb_shinfo(skb)->gso_size;
> >> adapter->tx_large_packets++;
> >> @@ -1224,6 +1234,71 @@ static void ibmveth_rx_mss_helper(struct sk_buff
> *skb, u16 mss, int lrg_pkt)
> >> }
> >> }
> >>
> >> +static void ibmveth_rx_csum_helper(struct sk_buff *skb,
> >> + struct ibmveth_adapter *adapter)
> >> +{
> >> + struct iphdr *iph = NULL;
> >> + struct ipv6hdr *iph6 = NULL;
> >> + __be16 skb_proto = 0;
> >> + u16 iphlen = 0;
> >> + u16 iph_proto = 0;
> >> + u16 tcphdrlen = 0;
> >> +
> >> + skb_proto = be16_to_cpu(skb->protocol);
> >> +
> >> + if (skb_proto == ETH_P_IP) {
> >> + iph = (struct iphdr *)skb->data;
> >> +
> >> + /* If the IP checksum is not offloaded and if the packet
> >> + * is large send, the checksum must be rebuilt.
> >> + */
> >> + if (iph->check == 0xffff) {
> >> + iph->check = 0;
> >> + iph->check = ip_fast_csum((unsigned char *)iph,
> >> + iph->ihl);
> >> + }
> >> +
> >> + iphlen = iph->ihl * 4;
> >> + iph_proto = iph->protocol;
> >> + } else if (skb_proto == ETH_P_IPV6) {
> >> + iph6 = (struct ipv6hdr *)skb->data;
> >> + iphlen = sizeof(struct ipv6hdr);
> >> + iph_proto = iph6->nexthdr;
> >> + }
> >> +
> >> + /* In OVS environment, when a flow is not cached, specifically
> for a
> >> + * new TCP connection, the first packet information is passed up
> >> + * the user space for finding a flow. During this process, OVS
> computes
> >> + * checksum on the first packet when CHECKSUM_PARTIAL flag is
> set.
> >> + *
> >> + * Given that we zeroed out TCP checksum field in transmit path
> >> + * (refer ibmveth_start_xmit routine) as we set "no checksum
> bit",
> >> + * OVS computed checksum will be incorrect w/o TCP pseudo
> checksum
> >> + * in the packet. This leads to OVS dropping the packet and
> hence
> >> + * TCP retransmissions are seen.
> >> + *
> >> + * So, re-compute TCP pseudo header checksum.
> >> + */
> >> + if (iph_proto == IPPROTO_TCP && adapter->is_active_trunk) {
> >> + struct tcphdr *tcph = (struct tcphdr *)(skb->data +
> iphlen);
> >> +
> >> + tcphdrlen = skb->len - iphlen;
> >> +
> >> + /* Recompute TCP pseudo header checksum */
> >> + if (skb_proto == ETH_P_IP)
> >> + tcph->check = ~csum_tcpudp_magic(iph->saddr,
> >> + iph->daddr, tcphdrlen,
> iph_proto, 0);
> >> + else if (skb_proto == ETH_P_IPV6)
> >> + tcph->check = ~csum_ipv6_magic(&iph6->saddr,
> >> + &iph6->daddr, tcphdrlen,
> iph_proto, 0);
> >> +
> >> + /* Setup SKB fields for checksum offload */
> >> + skb_partial_csum_set(skb, iphlen,
> >> + offsetof(struct tcphdr, check));
> >> + skb_reset_network_header(skb);
> >> + }
> >> +}
> >> +
> >> static int ibmveth_poll(struct napi_struct *napi, int budget)
> >> {
> >> struct ibmveth_adapter *adapter =
> >> @@ -1231,7 +1306,6 @@ static int ibmveth_poll(struct napi_struct *napi,
> int budget)
> >> struct net_device *netdev = adapter->netdev;
> >> int frames_processed = 0;
> >> unsigned long lpar_rc;
> >> - struct iphdr *iph;
> >> u16 mss = 0;
> >>
> >> restart_poll:
> >> @@ -1289,17 +1363,7 @@ restart_poll:
> >>
> >> if (csum_good) {
> >> skb->ip_summed = CHECKSUM_UNNECESSARY;
> >> - if (be16_to_cpu(skb->protocol) ==
> ETH_P_IP) {
> >> - iph = (struct iphdr *)skb->data;
> >> -
> >> - /* If the IP checksum is not
> offloaded and if the packet
> >> - * is large send, the checksum
> must be rebuilt.
> >> - */
> >> - if (iph->check == 0xffff) {
> >> - iph->check = 0;
> >> - iph->check =
> ip_fast_csum((unsigned char *)iph, iph->ihl);
> >> - }
> >> - }
> >> + ibmveth_rx_csum_helper(skb, adapter);
> >> }
> >>
> >> if (length > netdev->mtu + ETH_HLEN) {
> >> @@ -1621,6 +1685,13 @@ static int ibmveth_probe(struct vio_dev *dev,
> const struct vio_device_id *id)
> >> netdev->hw_features |= NETIF_F_TSO;
> >> }
> >>
> >> + adapter->is_active_trunk = false;
> >> + if (ret == H_SUCCESS && (ret_attr &
> IBMVETH_ILLAN_ACTIVE_TRUNK)) {
> >> + adapter->is_active_trunk = true;
> >> + netdev->hw_features |= NETIF_F_FRAGLIST;
> >> + netdev->features |= NETIF_F_FRAGLIST;
> >> + }
> >> +
> >> memcpy(netdev->dev_addr, mac_addr_p, ETH_ALEN);
> >>
> >> if (firmware_has_feature(FW_FEATURE_CMO))
> >> diff --git a/drivers/net/ethernet/ibm/ibmveth.h
> b/drivers/net/ethernet/ibm/ibmveth.h
> >> index 7acda04..de6e381 100644
> >> --- a/drivers/net/ethernet/ibm/ibmveth.h
> >> +++ b/drivers/net/ethernet/ibm/ibmveth.h
> >> @@ -157,6 +157,7 @@ struct ibmveth_adapter {
> >> int pool_config;
> >> int rx_csum;
> >> int large_send;
> >> + bool is_active_trunk;
> >> void *bounce_buffer;
> >> dma_addr_t bounce_buffer_dma;
> >>
> >> --
> >> 2.7.4
> >>
> >>
> >> --
> >> kernel-team mailing list
> >> kernel-team at lists.ubuntu.com
> >> https://lists.ubuntu.com/mailman/listinfo/kernel-team
> Hi Daniel,
>
> The SRU request for Xenial is on hold. I forwarded the information you
> provided to IBM. Thanks for the detailed write up! I'll keep you up to
> date on any feedback they provide.
>
> You could also subscribe to the bug report, and feel free to comment as
> needed:
> http://pad.lv/1692538
>
> Thanks,
>
> Joe
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20171118/7c2b3987/attachment.html>
More information about the kernel-team
mailing list