ACKED: [SRU][Zesty][Artful][PATCH 1/1] ibmveth: Support to enable LSO/CSO for Trunk VEA.

Mon Jul 31 08:38:40 UTC 2017

On 07/30/17 16:25, Joseph Salisbury wrote:
> From: Sivakumar Krishnasamy <ksiva at linux.vnet.ibm.com>
> 
> BugLink: http://bugs.launchpad.net/bugs/1692538
> 
> Current largesend and checksum offload feature in ibmveth driver,
>  - Source VM sends the TCP packets with ip_summed field set as
>    CHECKSUM_PARTIAL and TCP pseudo header checksum is placed in
>    checksum field
>  - CHECKSUM_PARTIAL flag in SKB will enable ibmveth driver to mark
>    "no checksum" and "checksum good" bits in transmit buffer descriptor
>    before the packet is delivered to pseries PowerVM Hypervisor
>  - If ibmveth has largesend capability enabled, transmit buffer descriptors
>    are market accordingly before packet is delivered to Hypervisor
>    (along with mss value for packets with length > MSS)
>  - Destination VM's ibmveth driver receives the packet with "checksum good"
>    bit set and so, SKB's ip_summed field is set with CHECKSUM_UNNECESSARY
>  - If "largesend" bit was on, mss value is copied from receive descriptor
>    into SKB's gso_size and other flags are appropriately set for
>    packets > MSS size
>  - The packet is now successfully delivered up the stack in destination VM
> 
> The offloads described above works fine for TCP communication among VMs in
> the same pseries server ( VM A <=> PowerVM Hypervisor <=> VM B )
> 
> We are now enabling support for OVS in pseries PowerVM environment. One of
> our requirements is to have ibmveth driver configured in "Trunk" mode, when
> they are used with OVS. This is because, PowerVM Hypervisor will no more
> bridge the packets between VMs, instead the packets are delivered to
> IO Server which hosts OVS to bridge them between VMs or to external
> networks (flow shown below),
>   VM A <=> PowerVM Hypervisor <=> IO Server(OVS) <=> PowerVM Hypervisor
>                                                                    <=> VM B
> In "IO server" the packet is received by inbound Trunk ibmveth and then
> delivered to OVS, which is then bridged to outbound Trunk ibmveth (shown
> below),
>         Inbound Trunk ibmveth <=> OVS <=> Outbound Trunk ibmveth
> 
> In this model, we hit the following issues which impacted the VM
> communication performance,
> 
>  - Issue 1: ibmveth doesn't support largesend and checksum offload features
>    when configured as "Trunk". Driver has explicit checks to prevent
>    enabling these offloads.
> 
>  - Issue 2: SYN packet drops seen at destination VM. When the packet
>    originates, it has CHECKSUM_PARTIAL flag set and as it gets delivered to
>    IO server's inbound Trunk ibmveth, on validating "checksum good" bits
>    in ibmveth receive routine, SKB's ip_summed field is set with
>    CHECKSUM_UNNECESSARY flag. This packet is then bridged by OVS (or Linux
>    Bridge) and delivered to outbound Trunk ibmveth. At this point the
>    outbound ibmveth transmit routine will not set "no checksum" and
>    "checksum good" bits in transmit buffer descriptor, as it does so only
>    when the ip_summed field is CHECKSUM_PARTIAL. When this packet gets
>    delivered to destination VM, TCP layer receives the packet with checksum
>    value of 0 and with no checksum related flags in ip_summed field. This
>    leads to packet drops. So, TCP connections never goes through fine.
> 
>  - Issue 3: First packet of a TCP connection will be dropped, if there is
>    no OVS flow cached in datapath. OVS while trying to identify the flow,
>    computes the checksum. The computed checksum will be invalid at the
>    receiving end, as ibmveth transmit routine zeroes out the pseudo
>    checksum value in the packet. This leads to packet drop.
> 
>  - Issue 4: ibmveth driver doesn't have support for SKB's with frag_list.
>    When Physical NIC has GRO enabled and when OVS bridges these packets,
>    OVS vport send code will end up calling dev_queue_xmit, which in turn
>    calls validate_xmit_skb.
>    In validate_xmit_skb routine, the larger packets will get segmented into
>    MSS sized segments, if SKB has a frag_list and if the driver to which
>    they are delivered to doesn't support NETIF_F_FRAGLIST feature.
> 
> This patch addresses the above four issues, thereby enabling end to end
> largesend and checksum offload support for better performance.
> 
>  - Fix for Issue 1 : Remove checks which prevent enabling TCP largesend and
>    checksum offloads.
>  - Fix for Issue 2 : When ibmveth receives a packet with "checksum good"
>    bit set and if its configured in Trunk mode, set appropriate SKB fields
>    using skb_partial_csum_set (ip_summed field is set with
>    CHECKSUM_PARTIAL)
>  - Fix for Issue 3: Recompute the pseudo header checksum before sending the
>    SKB up the stack.
>  - Fix for Issue 4: Linearize the SKBs with frag_list. Though we end up
>    allocating buffers and copying data, this fix gives
>    upto 4X throughput increase.
> 
> Note: All these fixes need to be dropped together as fixing just one of
> them will lead to other issues immediately (especially for Issues 1,2 & 3).
> 
> Signed-off-by: Sivakumar Krishnasamy <ksiva at linux.vnet.ibm.com>
> Signed-off-by: David S. Miller <davem at davemloft.net>
> (cherry picked from commit 66aa0678efc29abd2ab02a09b23f9a8bc9f12a6c)
> Signed-off-by: Joseph Salisbury <joseph.salisbury at canonical.com>

Clean cherry-pick, good test results and limited to a single driver.

Acked-by: Kleber Sacilotto de Souza <kleber.souza at canonical.com>

> ---
>  drivers/net/ethernet/ibm/ibmveth.c | 107 ++++++++++++++++++++++++++++++-------
>  drivers/net/ethernet/ibm/ibmveth.h |   1 +
>  2 files changed, 90 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
> index 72ab7b6..9a74c4e 100644
> --- a/drivers/net/ethernet/ibm/ibmveth.c
> +++ b/drivers/net/ethernet/ibm/ibmveth.c
> @@ -46,6 +46,8 @@
>  #include <asm/vio.h>
>  #include <asm/iommu.h>
>  #include <asm/firmware.h>
> +#include <net/tcp.h>
> +#include <net/ip6_checksum.h>
>  
>  #include "ibmveth.h"
>  
> @@ -808,8 +810,7 @@ static int ibmveth_set_csum_offload(struct net_device *dev, u32 data)
>  
>  	ret = h_illan_attributes(adapter->vdev->unit_address, 0, 0, &ret_attr);
>  
> -	if (ret == H_SUCCESS && !(ret_attr & IBMVETH_ILLAN_ACTIVE_TRUNK) &&
> -	    !(ret_attr & IBMVETH_ILLAN_TRUNK_PRI_MASK) &&
> +	if (ret == H_SUCCESS &&
>  	    (ret_attr & IBMVETH_ILLAN_PADDED_PKT_CSUM)) {
>  		ret4 = h_illan_attributes(adapter->vdev->unit_address, clr_attr,
>  					 set_attr, &ret_attr);
> @@ -1040,6 +1041,15 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb,
>  	dma_addr_t dma_addr;
>  	unsigned long mss = 0;
>  
> +	/* veth doesn't handle frag_list, so linearize the skb.
> +	 * When GRO is enabled SKB's can have frag_list.
> +	 */
> +	if (adapter->is_active_trunk &&
> +	    skb_has_frag_list(skb) && __skb_linearize(skb)) {
> +		netdev->stats.tx_dropped++;
> +		goto out;
> +	}
> +
>  	/*
>  	 * veth handles a maximum of 6 segments including the header, so
>  	 * we have to linearize the skb if there are more than this.
> @@ -1064,9 +1074,6 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb,
>  
>  	desc_flags = IBMVETH_BUF_VALID;
>  
> -	if (skb_is_gso(skb) && adapter->fw_large_send_support)
> -		desc_flags |= IBMVETH_BUF_LRG_SND;
> -
>  	if (skb->ip_summed == CHECKSUM_PARTIAL) {
>  		unsigned char *buf = skb_transport_header(skb) +
>  						skb->csum_offset;
> @@ -1076,6 +1083,9 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb,
>  		/* Need to zero out the checksum */
>  		buf[0] = 0;
>  		buf[1] = 0;
> +
> +		if (skb_is_gso(skb) && adapter->fw_large_send_support)
> +			desc_flags |= IBMVETH_BUF_LRG_SND;
>  	}
>  
>  retry_bounce:
> @@ -1128,7 +1138,7 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb,
>  		descs[i+1].fields.address = dma_addr;
>  	}
>  
> -	if (skb_is_gso(skb)) {
> +	if (skb->ip_summed == CHECKSUM_PARTIAL && skb_is_gso(skb)) {
>  		if (adapter->fw_large_send_support) {
>  			mss = (unsigned long)skb_shinfo(skb)->gso_size;
>  			adapter->tx_large_packets++;
> @@ -1232,6 +1242,71 @@ static void ibmveth_rx_mss_helper(struct sk_buff *skb, u16 mss, int lrg_pkt)
>  	}
>  }
>  
> +static void ibmveth_rx_csum_helper(struct sk_buff *skb,
> +				   struct ibmveth_adapter *adapter)
> +{
> +	struct iphdr *iph = NULL;
> +	struct ipv6hdr *iph6 = NULL;
> +	__be16 skb_proto = 0;
> +	u16 iphlen = 0;
> +	u16 iph_proto = 0;
> +	u16 tcphdrlen = 0;
> +
> +	skb_proto = be16_to_cpu(skb->protocol);
> +
> +	if (skb_proto == ETH_P_IP) {
> +		iph = (struct iphdr *)skb->data;
> +
> +		/* If the IP checksum is not offloaded and if the packet
> +		 *  is large send, the checksum must be rebuilt.
> +		 */
> +		if (iph->check == 0xffff) {
> +			iph->check = 0;
> +			iph->check = ip_fast_csum((unsigned char *)iph,
> +						  iph->ihl);
> +		}
> +
> +		iphlen = iph->ihl * 4;
> +		iph_proto = iph->protocol;
> +	} else if (skb_proto == ETH_P_IPV6) {
> +		iph6 = (struct ipv6hdr *)skb->data;
> +		iphlen = sizeof(struct ipv6hdr);
> +		iph_proto = iph6->nexthdr;
> +	}
> +
> +	/* In OVS environment, when a flow is not cached, specifically for a
> +	 * new TCP connection, the first packet information is passed up
> +	 * the user space for finding a flow. During this process, OVS computes
> +	 * checksum on the first packet when CHECKSUM_PARTIAL flag is set.
> +	 *
> +	 * Given that we zeroed out TCP checksum field in transmit path
> +	 * (refer ibmveth_start_xmit routine) as we set "no checksum bit",
> +	 * OVS computed checksum will be incorrect w/o TCP pseudo checksum
> +	 * in the packet. This leads to OVS dropping the packet and hence
> +	 * TCP retransmissions are seen.
> +	 *
> +	 * So, re-compute TCP pseudo header checksum.
> +	 */
> +	if (iph_proto == IPPROTO_TCP && adapter->is_active_trunk) {
> +		struct tcphdr *tcph = (struct tcphdr *)(skb->data + iphlen);
> +
> +		tcphdrlen = skb->len - iphlen;
> +
> +		/* Recompute TCP pseudo header checksum */
> +		if (skb_proto == ETH_P_IP)
> +			tcph->check = ~csum_tcpudp_magic(iph->saddr,
> +					iph->daddr, tcphdrlen, iph_proto, 0);
> +		else if (skb_proto == ETH_P_IPV6)
> +			tcph->check = ~csum_ipv6_magic(&iph6->saddr,
> +					&iph6->daddr, tcphdrlen, iph_proto, 0);
> +
> +		/* Setup SKB fields for checksum offload */
> +		skb_partial_csum_set(skb, iphlen,
> +				     offsetof(struct tcphdr, check));
> +		skb_reset_network_header(skb);
> +	}
> +}
> +
>  static int ibmveth_poll(struct napi_struct *napi, int budget)
>  {
>  	struct ibmveth_adapter *adapter =
> @@ -1239,7 +1314,6 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
>  	struct net_device *netdev = adapter->netdev;
>  	int frames_processed = 0;
>  	unsigned long lpar_rc;
> -	struct iphdr *iph;
>  	u16 mss = 0;
>  
>  restart_poll:
> @@ -1297,17 +1371,7 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
>  
>  			if (csum_good) {
>  				skb->ip_summed = CHECKSUM_UNNECESSARY;
> -				if (be16_to_cpu(skb->protocol) == ETH_P_IP) {
> -					iph = (struct iphdr *)skb->data;
> -
> -					/* If the IP checksum is not offloaded and if the packet
> -					 *  is large send, the checksum must be rebuilt.
> -					 */
> -					if (iph->check == 0xffff) {
> -						iph->check = 0;
> -						iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl);
> -					}
> -				}
> +				ibmveth_rx_csum_helper(skb, adapter);
>  			}
>  
>  			if (length > netdev->mtu + ETH_HLEN) {
> @@ -1626,6 +1690,13 @@ static int ibmveth_probe(struct vio_dev *dev, const struct vio_device_id *id)
>  		netdev->hw_features |= NETIF_F_TSO;
>  	}
>  
> +	adapter->is_active_trunk = false;
> +	if (ret == H_SUCCESS && (ret_attr & IBMVETH_ILLAN_ACTIVE_TRUNK)) {
> +		adapter->is_active_trunk = true;
> +		netdev->hw_features |= NETIF_F_FRAGLIST;
> +		netdev->features |= NETIF_F_FRAGLIST;
> +	}
> +
>  	netdev->min_mtu = IBMVETH_MIN_MTU;
>  	netdev->max_mtu = ETH_MAX_MTU;
>  
> diff --git a/drivers/net/ethernet/ibm/ibmveth.h b/drivers/net/ethernet/ibm/ibmveth.h
> index 7acda04..de6e381 100644
> --- a/drivers/net/ethernet/ibm/ibmveth.h
> +++ b/drivers/net/ethernet/ibm/ibmveth.h
> @@ -157,6 +157,7 @@ struct ibmveth_adapter {
>      int pool_config;
>      int rx_csum;
>      int large_send;
> +    bool is_active_trunk;
>      void *bounce_buffer;
>      dma_addr_t bounce_buffer_dma;
>  
>