[RFC][Focal][PATCH 2/2] RDMA/core: Introduce peer memory interface

Tue Aug 31 12:34:37 UTC 2021

On 31.08.21 00:50, dann frazier wrote:
> From: Feras Daoud <ferasda at mellanox.com>
> 
> BugLink: https://bugs.launchpad.net/bugs/1923104
> 
> The peer_memory_client scheme allows a driver to register with the ib_umem
> system that it has the ability to understand user virtual address ranges
> that are not compatible with get_user_pages(). For instance VMAs created
> with io_remap_pfn_range(), or other driver special VMA.
> 
> For ranges the interface understands it can provide a DMA mapped sg_table
> for use by the ib_umem, allowing user virtual ranges that cannot be
> supported by get_user_pages() to be used as umems for RDMA.
> 
> This is designed to preserve the kABI, no functions or structures are
> changed, only new symbols are added:
> 
>   ib_register_peer_memory_client
>   ib_unregister_peer_memory_client
>   ib_umem_activate_invalidation_notifier
>   ib_umem_get_peer
> 
> And a bitfield in struct ib_umem uses more bits.
> 
> This interface is compatible with the two out of tree GPU drivers:
>   https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/blob/master/drivers/gpu/drm/amd/amdkfd/kfd_peerdirect.c
>   https://github.com/Mellanox/nv_peer_memory/blob/master/nv_peer_mem.c
> 
> NOTES (remote before sending):
>   - The exact locking semantics from the GPU side during invalidation
>     are confusing. I've made it sane but perhaps this will hit locking
>     problems. Test with lockdep and test invalidation.
> 
>     The main difference here is that get_pages and dma_map are called
>     from a context that will block progress of invalidation.
> 
>     The old design blocked progress of invalidation using a completion for
>     unmap and unpin, so those should be proven safe now.
> 
>     Since the old design used a completion it doesn't work with lockdep,
>     even though it has basically the same blocking semantics.
> 
>   - The API exported to the GPU side is crufty and makes very little
>     sense. Functionally it should be the same still, but many useless
>     things were dropped off
> 
>   - I rewrote all the comments please check spelling/grammar
> 
>   - Compile tested only
> 
> Issue: 2189651
> Change-Id: I1d77f52d56aec2c79e6b9d9ec1096e83a95155cd

I am assuming this is some import from some vendor tree (and as such would be a 
SAUCE patch). But this also brings up the question about upstream. If it is 
upstream but as a sequence of patches, then those would be rather what a LTS 
kernel should backport. If this still is not upstream this will remain a 
liability for all upcoming releases until it is. Either as a SAUCE patch that 
has to be adapted while carrying forward or something that potentially gets 
forgotten. And even if it is a SAUCE patch in Impish already, there is always a 
chance that upstream changes in ways that cause us more and more pain.

-Stefan

> Signed-off-by: Yishai Hadas <yishaih at mellanox.com>
> Signed-off-by: Feras Daoud <ferasda at mellanox.com>
> Signed-off-by: Jason Gunthorpe <jgg at mellanox.com>
> Signed-off-by: dann frazier <dann.frazier at canonical.com>
> ---
>   drivers/infiniband/core/Makefile      |   2 +-
>   drivers/infiniband/core/ib_peer_mem.h |  52 +++
>   drivers/infiniband/core/peer_mem.c    | 484 ++++++++++++++++++++++++++
>   drivers/infiniband/core/umem.c        |  44 ++-
>   drivers/infiniband/hw/mlx5/cq.c       |  11 +-
>   drivers/infiniband/hw/mlx5/devx.c     |   2 +-
>   drivers/infiniband/hw/mlx5/doorbell.c |   4 +-
>   drivers/infiniband/hw/mlx5/mem.c      |  11 +-
>   drivers/infiniband/hw/mlx5/mr.c       |  47 ++-
>   drivers/infiniband/hw/mlx5/qp.c       |   2 +-
>   drivers/infiniband/hw/mlx5/srq.c      |   2 +-
>   include/rdma/ib_umem.h                |  29 ++
>   include/rdma/peer_mem.h               | 165 +++++++++
>   13 files changed, 828 insertions(+), 27 deletions(-)
>   create mode 100644 drivers/infiniband/core/ib_peer_mem.h
>   create mode 100644 drivers/infiniband/core/peer_mem.c
>   create mode 100644 include/rdma/peer_mem.h
> 
> diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
> index 9a8871e21545..4b7838ff6e90 100644
> --- a/drivers/infiniband/core/Makefile
> +++ b/drivers/infiniband/core/Makefile
> @@ -34,5 +34,5 @@ ib_uverbs-y :=			uverbs_main.o uverbs_cmd.o uverbs_marshall.o \
>   				uverbs_std_types_flow_action.o uverbs_std_types_dm.o \
>   				uverbs_std_types_mr.o uverbs_std_types_counters.o \
>   				uverbs_uapi.o uverbs_std_types_device.o
> -ib_uverbs-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
> +ib_uverbs-$(CONFIG_INFINIBAND_USER_MEM) += umem.o peer_mem.o
>   ib_uverbs-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o
> diff --git a/drivers/infiniband/core/ib_peer_mem.h b/drivers/infiniband/core/ib_peer_mem.h
> new file mode 100644
> index 000000000000..bb38ffee724a
> --- /dev/null
> +++ b/drivers/infiniband/core/ib_peer_mem.h
> @@ -0,0 +1,52 @@
> +/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
> +/*
> + * Copyright (c) 2014-2020,  Mellanox Technologies. All rights reserved.
> + */
> +#ifndef RDMA_IB_PEER_MEM_H
> +#define RDMA_IB_PEER_MEM_H
> +
> +#include <rdma/peer_mem.h>
> +#include <linux/kobject.h>
> +#include <linux/xarray.h>
> +#include <rdma/ib_umem.h>
> +
> +struct ib_peer_memory_statistics {
> +	atomic64_t num_alloc_mrs;
> +	atomic64_t num_dealloc_mrs;
> +	atomic64_t num_reg_pages;
> +	atomic64_t num_dereg_pages;
> +	atomic64_t num_reg_bytes;
> +	atomic64_t num_dereg_bytes;
> +	unsigned long num_free_callbacks;
> +};
> +
> +struct ib_peer_memory_client {
> +	struct kobject kobj;
> +	refcount_t usecnt;
> +	struct completion usecnt_zero;
> +	const struct peer_memory_client *peer_mem;
> +	struct list_head core_peer_list;
> +	struct ib_peer_memory_statistics stats;
> +	struct xarray umem_xa;
> +	u32 xa_cyclic_next;
> +	bool invalidation_required;
> +};
> +
> +struct ib_umem_peer {
> +	struct ib_umem umem;
> +	struct kref kref;
> +	/* peer memory that manages this umem */
> +	struct ib_peer_memory_client *ib_peer_client;
> +	void *peer_client_context;
> +	umem_invalidate_func_t invalidation_func;
> +	void *invalidation_private;
> +	struct mutex mapping_lock;
> +	bool mapped;
> +	u32 xa_id;
> +};
> +
> +struct ib_umem *ib_peer_umem_get(struct ib_umem *old_umem, int old_ret,
> +				 unsigned long peer_mem_flags);
> +void ib_peer_umem_release(struct ib_umem *umem);
> +
> +#endif
> diff --git a/drivers/infiniband/core/peer_mem.c b/drivers/infiniband/core/peer_mem.c
> new file mode 100644
> index 000000000000..833865578cb0
> --- /dev/null
> +++ b/drivers/infiniband/core/peer_mem.c
> @@ -0,0 +1,484 @@
> +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
> +/*
> + * Copyright (c) 2014-2020,  Mellanox Technologies. All rights reserved.
> + */
> +#include <rdma/ib_verbs.h>
> +#include <rdma/ib_umem.h>
> +#include <linux/sched/mm.h>
> +#include "ib_peer_mem.h"
> +static DEFINE_MUTEX(peer_memory_mutex);
> +static LIST_HEAD(peer_memory_list);
> +static struct kobject *peers_kobj;
> +#define PEER_NO_INVALIDATION_ID U32_MAX
> +static int ib_invalidate_peer_memory(void *reg_handle, u64 core_context);
> +struct peer_mem_attribute {
> +	struct attribute attr;
> +	ssize_t (*show)(struct ib_peer_memory_client *ib_peer_client,
> +			struct peer_mem_attribute *attr, char *buf);
> +	ssize_t (*store)(struct ib_peer_memory_client *ib_peer_client,
> +			 struct peer_mem_attribute *attr, const char *buf,
> +			 size_t count);
> +};
> +
> +#define PEER_ATTR_RO(_name)                                                    \
> +	struct peer_mem_attribute peer_attr_ ## _name = __ATTR_RO(_name)
> +
> +static ssize_t version_show(struct ib_peer_memory_client *ib_peer_client,
> +			    struct peer_mem_attribute *attr, char *buf)
> +{
> +	return scnprintf(buf, PAGE_SIZE, "%s\n",
> +			 ib_peer_client->peer_mem->version);
> +}
> +
> +static PEER_ATTR_RO(version);
> +static ssize_t num_alloc_mrs_show(struct ib_peer_memory_client *ib_peer_client,
> +				  struct peer_mem_attribute *attr, char *buf)
> +{
> +	return scnprintf(
> +		buf, PAGE_SIZE, "%llu\n",
> +		(u64)atomic64_read(&ib_peer_client->stats.num_alloc_mrs));
> +}
> +
> +static PEER_ATTR_RO(num_alloc_mrs);
> +static ssize_t
> +num_dealloc_mrs_show(struct ib_peer_memory_client *ib_peer_client,
> +		     struct peer_mem_attribute *attr, char *buf)
> +{
> +	return scnprintf(
> +		buf, PAGE_SIZE, "%llu\n",
> +		(u64)atomic64_read(&ib_peer_client->stats.num_dealloc_mrs));
> +}
> +
> +static PEER_ATTR_RO(num_dealloc_mrs);
> +static ssize_t num_reg_pages_show(struct ib_peer_memory_client *ib_peer_client,
> +				  struct peer_mem_attribute *attr, char *buf)
> +{
> +	return scnprintf(
> +		buf, PAGE_SIZE, "%llu\n",
> +		(u64)atomic64_read(&ib_peer_client->stats.num_reg_pages));
> +}
> +
> +static PEER_ATTR_RO(num_reg_pages);
> +static ssize_t
> +num_dereg_pages_show(struct ib_peer_memory_client *ib_peer_client,
> +		     struct peer_mem_attribute *attr, char *buf)
> +{
> +	return scnprintf(
> +		buf, PAGE_SIZE, "%llu\n",
> +		(u64)atomic64_read(&ib_peer_client->stats.num_dereg_pages));
> +}
> +
> +static PEER_ATTR_RO(num_dereg_pages);
> +static ssize_t num_reg_bytes_show(struct ib_peer_memory_client *ib_peer_client,
> +				  struct peer_mem_attribute *attr, char *buf)
> +{
> +	return scnprintf(
> +		buf, PAGE_SIZE, "%llu\n",
> +		(u64)atomic64_read(&ib_peer_client->stats.num_reg_bytes));
> +}
> +
> +static PEER_ATTR_RO(num_reg_bytes);
> +static ssize_t
> +num_dereg_bytes_show(struct ib_peer_memory_client *ib_peer_client,
> +		     struct peer_mem_attribute *attr, char *buf)
> +{
> +	return scnprintf(
> +		buf, PAGE_SIZE, "%llu\n",
> +		(u64)atomic64_read(&ib_peer_client->stats.num_dereg_bytes));
> +}
> +
> +static PEER_ATTR_RO(num_dereg_bytes);
> +static ssize_t
> +num_free_callbacks_show(struct ib_peer_memory_client *ib_peer_client,
> +			struct peer_mem_attribute *attr, char *buf)
> +{
> +	return scnprintf(buf, PAGE_SIZE, "%lu\n",
> +			 ib_peer_client->stats.num_free_callbacks);
> +}
> +
> +static PEER_ATTR_RO(num_free_callbacks);
> +static struct attribute *peer_mem_attrs[] = {
> +			&peer_attr_version.attr,
> +			&peer_attr_num_alloc_mrs.attr,
> +			&peer_attr_num_dealloc_mrs.attr,
> +			&peer_attr_num_reg_pages.attr,
> +			&peer_attr_num_dereg_pages.attr,
> +			&peer_attr_num_reg_bytes.attr,
> +			&peer_attr_num_dereg_bytes.attr,
> +			&peer_attr_num_free_callbacks.attr,
> +			NULL,
> +};
> +
> +static const struct attribute_group peer_mem_attr_group = {
> +	.attrs = peer_mem_attrs,
> +};
> +
> +static ssize_t peer_attr_show(struct kobject *kobj, struct attribute *attr,
> +			      char *buf)
> +{
> +	struct peer_mem_attribute *peer_attr =
> +		container_of(attr, struct peer_mem_attribute, attr);
> +	if (!peer_attr->show)
> +		return -EIO;
> +	return peer_attr->show(container_of(kobj, struct ib_peer_memory_client,
> +					    kobj),
> +			       peer_attr, buf);
> +}
> +
> +static const struct sysfs_ops peer_mem_sysfs_ops = {
> +	.show = peer_attr_show,
> +};
> +
> +static void ib_peer_memory_client_release(struct kobject *kobj)
> +{
> +	struct ib_peer_memory_client *ib_peer_client =
> +		container_of(kobj, struct ib_peer_memory_client, kobj);
> +	kfree(ib_peer_client);
> +}
> +
> +static struct kobj_type peer_mem_type = {
> +	.sysfs_ops = &peer_mem_sysfs_ops,
> +	.release = ib_peer_memory_client_release,
> +};
> +
> +static int ib_memory_peer_check_mandatory(const struct peer_memory_client
> +						     *peer_client)
> +{
> +#define PEER_MEM_MANDATORY_FUNC(x) {offsetof(struct peer_memory_client, x), #x}
> +	int i;
> +	static const struct {
> +		size_t offset;
> +		char *name;
> +	} mandatory_table[] = {
> +		PEER_MEM_MANDATORY_FUNC(acquire),
> +		PEER_MEM_MANDATORY_FUNC(get_pages),
> +		PEER_MEM_MANDATORY_FUNC(put_pages),
> +		PEER_MEM_MANDATORY_FUNC(dma_map),
> +		PEER_MEM_MANDATORY_FUNC(dma_unmap),
> +	};
> +	for (i = 0; i < ARRAY_SIZE(mandatory_table); ++i) {
> +		if (!*(void **)((void *)peer_client +
> +				mandatory_table[i].offset)) {
> +			pr_err("Peer memory %s is missing mandatory function %s\n",
> +			       peer_client->name, mandatory_table[i].name);
> +			return -EINVAL;
> +		}
> +	}
> +	return 0;
> +}
> +
> +void *
> +ib_register_peer_memory_client(const struct peer_memory_client *peer_client,
> +			       invalidate_peer_memory *invalidate_callback)
> +{
> +	struct ib_peer_memory_client *ib_peer_client;
> +	int ret;
> +	if (ib_memory_peer_check_mandatory(peer_client))
> +		return NULL;
> +	ib_peer_client = kzalloc(sizeof(*ib_peer_client), GFP_KERNEL);
> +	if (!ib_peer_client)
> +		return NULL;
> +	kobject_init(&ib_peer_client->kobj, &peer_mem_type);
> +	refcount_set(&ib_peer_client->usecnt, 1);
> +	init_completion(&ib_peer_client->usecnt_zero);
> +	ib_peer_client->peer_mem = peer_client;
> +	xa_init_flags(&ib_peer_client->umem_xa, XA_FLAGS_ALLOC);
> +	/*
> +	 * If the peer wants the invalidation_callback then all memory users
> +	 * linked to that peer must support invalidation.
> +	 */
> +	if (invalidate_callback) {
> +		*invalidate_callback = ib_invalidate_peer_memory;
> +		ib_peer_client->invalidation_required = true;
> +	}
> +	mutex_lock(&peer_memory_mutex);
> +	if (!peers_kobj) {
> +		/* Created under /sys/kernel/mm */
> +		peers_kobj = kobject_create_and_add("memory_peers", mm_kobj);
> +		if (!peers_kobj)
> +			goto err_unlock;
> +	}
> +	ret = kobject_add(&ib_peer_client->kobj, peers_kobj, peer_client->name);
> +	if (ret)
> +		goto err_parent;
> +	ret = sysfs_create_group(&ib_peer_client->kobj,
> +				 &peer_mem_attr_group);
> +	if (ret)
> +		goto err_parent;
> +	list_add_tail(&ib_peer_client->core_peer_list, &peer_memory_list);
> +	mutex_unlock(&peer_memory_mutex);
> +	return ib_peer_client;
> +err_parent:
> +	if (list_empty(&peer_memory_list)) {
> +		kobject_put(peers_kobj);
> +		peers_kobj = NULL;
> +	}
> +err_unlock:
> +	mutex_unlock(&peer_memory_mutex);
> +	kobject_put(&ib_peer_client->kobj);
> +	return NULL;
> +}
> +EXPORT_SYMBOL(ib_register_peer_memory_client);
> +
> +void ib_unregister_peer_memory_client(void *reg_handle)
> +{
> +	struct ib_peer_memory_client *ib_peer_client = reg_handle;
> +	mutex_lock(&peer_memory_mutex);
> +	list_del(&ib_peer_client->core_peer_list);
> +	if (list_empty(&peer_memory_list)) {
> +		kobject_put(peers_kobj);
> +		peers_kobj = NULL;
> +	}
> +	mutex_unlock(&peer_memory_mutex);
> +	/*
> +	 * Wait for all umems to be destroyed before returning. Once
> +	 * ib_unregister_peer_memory_client() returns no umems will call any
> +	 * peer_mem ops.
> +	 */
> +	if (refcount_dec_and_test(&ib_peer_client->usecnt))
> +		complete(&ib_peer_client->usecnt_zero);
> +	wait_for_completion(&ib_peer_client->usecnt_zero);
> +	kobject_put(&ib_peer_client->kobj);
> +}
> +EXPORT_SYMBOL(ib_unregister_peer_memory_client);
> +
> +static struct ib_peer_memory_client *
> +ib_get_peer_client(unsigned long addr, size_t size,
> +		   unsigned long peer_mem_flags, void **peer_client_context)
> +{
> +	struct ib_peer_memory_client *ib_peer_client;
> +	int ret = 0;
> +	mutex_lock(&peer_memory_mutex);
> +	list_for_each_entry(ib_peer_client, &peer_memory_list,
> +			    core_peer_list) {
> +		if (ib_peer_client->invalidation_required &&
> +		    (!(peer_mem_flags & IB_PEER_MEM_INVAL_SUPP)))
> +			continue;
> +		ret = ib_peer_client->peer_mem->acquire(addr, size, NULL, NULL,
> +							peer_client_context);
> +		if (ret > 0) {
> +			refcount_inc(&ib_peer_client->usecnt);
> +			mutex_unlock(&peer_memory_mutex);
> +			return ib_peer_client;
> +		}
> +	}
> +	mutex_unlock(&peer_memory_mutex);
> +	return NULL;
> +}
> +
> +static void ib_put_peer_client(struct ib_peer_memory_client *ib_peer_client,
> +			       void *peer_client_context)
> +{
> +	if (ib_peer_client->peer_mem->release)
> +		ib_peer_client->peer_mem->release(peer_client_context);
> +	if (refcount_dec_and_test(&ib_peer_client->usecnt))
> +		complete(&ib_peer_client->usecnt_zero);
> +}
> +
> +static void ib_peer_umem_kref_release(struct kref *kref)
> +{
> +	kfree(container_of(kref, struct ib_umem_peer, kref));
> +}
> +
> +static void ib_unmap_peer_client(struct ib_umem_peer *umem_p)
> +{
> +	struct ib_peer_memory_client *ib_peer_client = umem_p->ib_peer_client;
> +	const struct peer_memory_client *peer_mem = ib_peer_client->peer_mem;
> +	struct ib_umem *umem = &umem_p->umem;
> +
> +	lockdep_assert_held(&umem_p->mapping_lock);
> +
> +	peer_mem->dma_unmap(&umem_p->umem.sg_head, umem_p->peer_client_context,
> +			    umem_p->umem.ibdev->dma_device);
> +	peer_mem->put_pages(&umem_p->umem.sg_head, umem_p->peer_client_context);
> +	memset(&umem->sg_head, 0, sizeof(umem->sg_head));
> +
> +	atomic64_add(umem->nmap, &ib_peer_client->stats.num_dereg_pages);
> +	atomic64_add(umem->length, &ib_peer_client->stats.num_dereg_bytes);
> +	atomic64_inc(&ib_peer_client->stats.num_dealloc_mrs);
> +
> +	if (umem_p->xa_id != PEER_NO_INVALIDATION_ID)
> +		xa_store(&ib_peer_client->umem_xa, umem_p->xa_id, NULL,
> +			 GFP_KERNEL);
> +	umem_p->mapped = false;
> +}
> +
> +static int ib_invalidate_peer_memory(void *reg_handle, u64 core_context)
> +{
> +	struct ib_peer_memory_client *ib_peer_client = reg_handle;
> +	struct ib_umem_peer *umem_p;
> +
> +	/*
> +	 * The client is not required to fence against invalidation during
> +	 * put_pages() as that would deadlock when we call put_pages() here.
> +	 * Thus the core_context cannot be a umem pointer as we have no control
> +	 * over the lifetime. Since we won't change the kABI for this to add a
> +	 * proper kref, an xarray is used.
> +	 */
> +	xa_lock(&ib_peer_client->umem_xa);
> +	ib_peer_client->stats.num_free_callbacks += 1;
> +	umem_p = xa_load(&ib_peer_client->umem_xa, core_context);
> +	if (!umem_p)
> +		goto out_unlock;
> +	kref_get(&umem_p->kref);
> +	xa_unlock(&ib_peer_client->umem_xa);
> +	mutex_lock(&umem_p->mapping_lock);
> +	if (umem_p->mapped) {
> +		/*
> +		 * At this point the invalidation_func must be !NULL as the get
> +		 * flow does not unlock mapping_lock until it is set, and umems
> +		 * that do not require invalidation are not in the xarray.
> +		 */
> +		umem_p->invalidation_func(&umem_p->umem,
> +					  umem_p->invalidation_private);
> +		ib_unmap_peer_client(umem_p);
> +	}
> +	mutex_unlock(&umem_p->mapping_lock);
> +	kref_put(&umem_p->kref, ib_peer_umem_kref_release);
> +	return 0;
> +out_unlock:
> +	xa_unlock(&ib_peer_client->umem_xa);
> +	return 0;
> +}
> +
> +void ib_umem_activate_invalidation_notifier(struct ib_umem *umem,
> +					    umem_invalidate_func_t func,
> +					    void *priv)
> +{
> +	struct ib_umem_peer *umem_p =
> +		container_of(umem, struct ib_umem_peer, umem);
> +
> +	if (WARN_ON(!umem->is_peer))
> +		return;
> +	if (umem_p->xa_id == PEER_NO_INVALIDATION_ID)
> +		return;
> +
> +	umem_p->invalidation_func = func;
> +	umem_p->invalidation_private = priv;
> +	/* Pairs with the lock in ib_peer_umem_get() */
> +	mutex_unlock(&umem_p->mapping_lock);
> +
> +	/* At this point func can be called asynchronously */
> +}
> +EXPORT_SYMBOL(ib_umem_activate_invalidation_notifier);
> +
> +struct ib_umem *ib_peer_umem_get(struct ib_umem *old_umem, int old_ret,
> +				 unsigned long peer_mem_flags)
> +{
> +	struct ib_peer_memory_client *ib_peer_client;
> +	void *peer_client_context;
> +	struct ib_umem_peer *umem_p;
> +	int ret;
> +	ib_peer_client =
> +		ib_get_peer_client(old_umem->address, old_umem->length,
> +				   peer_mem_flags, &peer_client_context);
> +	if (!ib_peer_client)
> +		return ERR_PTR(old_ret);
> +	umem_p = kzalloc(sizeof(*umem_p), GFP_KERNEL);
> +	if (!umem_p) {
> +		ret = -ENOMEM;
> +		goto err_client;
> +	}
> +
> +	kref_init(&umem_p->kref);
> +	umem_p->umem = *old_umem;
> +	memset(&umem_p->umem.sg_head, 0, sizeof(umem_p->umem.sg_head));
> +	umem_p->umem.is_peer = 1;
> +	umem_p->ib_peer_client = ib_peer_client;
> +	umem_p->peer_client_context = peer_client_context;
> +	mutex_init(&umem_p->mapping_lock);
> +	umem_p->xa_id = PEER_NO_INVALIDATION_ID;
> +
> +	mutex_lock(&umem_p->mapping_lock);
> +	if (ib_peer_client->invalidation_required) {
> +		ret = xa_alloc_cyclic(&ib_peer_client->umem_xa, &umem_p->xa_id,
> +				      umem_p,
> +				      XA_LIMIT(0, PEER_NO_INVALIDATION_ID - 1),
> +				      &ib_peer_client->xa_cyclic_next,
> +				      GFP_KERNEL);
> +		if (ret < 0)
> +			goto err_umem;
> +	}
> +
> +	/*
> +	 * We always request write permissions to the pages, to force breaking
> +	 * of any CoW during the registration of the MR. For read-only MRs we
> +	 * use the "force" flag to indicate that CoW breaking is required but
> +	 * the registration should not fail if referencing read-only areas.
> +	 */
> +	ret = ib_peer_client->peer_mem->get_pages(umem_p->umem.address,
> +						  umem_p->umem.length, 1,
> +						  !umem_p->umem.writable, NULL,
> +						  peer_client_context,
> +						  umem_p->xa_id);
> +	if (ret)
> +		goto err_xa;
> +
> +	umem_p->umem.page_shift =
> +		ilog2(ib_peer_client->peer_mem->get_page_size(peer_client_context));
> +
> +	ret = ib_peer_client->peer_mem->dma_map(&umem_p->umem.sg_head,
> +						peer_client_context,
> +						umem_p->umem.ibdev->dma_device,
> +						0, &umem_p->umem.nmap);
> +	if (ret)
> +		goto err_pages;
> +
> +	umem_p->mapped = true;
> +	atomic64_add(umem_p->umem.nmap, &ib_peer_client->stats.num_reg_pages);
> +	atomic64_add(umem_p->umem.length, &ib_peer_client->stats.num_reg_bytes);
> +	atomic64_inc(&ib_peer_client->stats.num_alloc_mrs);
> +
> +	/*
> +	 * If invalidation is allowed then the caller must call
> +	 * ib_umem_activate_invalidation_notifier() or ib_peer_umem_release() to
> +	 * unlock this mutex. The call to  should be done after the last
> +	 * read to sg_head, once the caller is ready for the invalidation
> +	 * function to be called.
> +	 */
> +	if (umem_p->xa_id == PEER_NO_INVALIDATION_ID)
> +		mutex_unlock(&umem_p->mapping_lock);
> +	/*
> +	 * On success the old umem is replaced with the new, larger, allocation
> +	 */
> +	kfree(old_umem);
> +	return &umem_p->umem;
> +err_pages:
> +	ib_peer_client->peer_mem->put_pages(&umem_p->umem.sg_head,
> +					    umem_p->peer_client_context);
> +err_xa:
> +	if (umem_p->xa_id != PEER_NO_INVALIDATION_ID)
> +		xa_erase(&umem_p->ib_peer_client->umem_xa, umem_p->xa_id);
> +err_umem:
> +	mutex_unlock(&umem_p->mapping_lock);
> +	kref_put(&umem_p->kref, ib_peer_umem_kref_release);
> +err_client:
> +	ib_put_peer_client(ib_peer_client, peer_client_context);
> +	return ERR_PTR(ret);
> +}
> +
> +void ib_peer_umem_release(struct ib_umem *umem)
> +{
> +	struct ib_umem_peer *umem_p =
> +		container_of(umem, struct ib_umem_peer, umem);
> +
> +	/* invalidation_func being set indicates activate was called */
> +	if (umem_p->xa_id == PEER_NO_INVALIDATION_ID ||
> +	    umem_p->invalidation_func)
> +		mutex_lock(&umem_p->mapping_lock);
> +
> +	if (umem_p->mapped)
> +		ib_unmap_peer_client(umem_p);
> +	mutex_unlock(&umem_p->mapping_lock);
> +
> +	if (umem_p->xa_id != PEER_NO_INVALIDATION_ID)
> +		xa_erase(&umem_p->ib_peer_client->umem_xa, umem_p->xa_id);
> +	ib_put_peer_client(umem_p->ib_peer_client, umem_p->peer_client_context);
> +	umem_p->ib_peer_client = NULL;
> +
> +	/* Must match ib_umem_release() */
> +	atomic64_sub(ib_umem_num_pages(umem), &umem->owning_mm->pinned_vm);
> +	mmdrop(umem->owning_mm);
> +
> +	kref_put(&umem_p->kref, ib_peer_umem_kref_release);
> +}
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index 698c5359f643..e7473285e470 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -42,6 +42,7 @@
>   #include <rdma/ib_umem_odp.h>
>   
>   #include "uverbs.h"
> +#include "ib_peer_mem.h"
>   
>   static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
>   {
> @@ -193,15 +194,17 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
>   EXPORT_SYMBOL(ib_umem_find_best_pgsz);
>   
>   /**
> - * ib_umem_get - Pin and DMA map userspace memory.
> + * __ib_umem_get - Pin and DMA map userspace memory.
>    *
>    * @device: IB device to connect UMEM
>    * @addr: userspace virtual address to start at
>    * @size: length of region to pin
>    * @access: IB_ACCESS_xxx flags for memory being pinned
> + * @peer_mem_flags: IB_PEER_MEM_xxx flags for memory being used
>    */
> -struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
> -			    size_t size, int access)
> +struct ib_umem *__ib_umem_get(struct ib_device *device,
> +			      unsigned long addr, size_t size, int access,
> +			      unsigned long peer_mem_flags)
>   {
>   	struct ib_umem *umem;
>   	struct page **page_list;
> @@ -309,6 +312,24 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
>   
>   umem_release:
>   	__ib_umem_release(device, umem, 0);
> +	/*
> +	 * If the address belongs to peer memory client, then the first
> +	 * call to get_user_pages will fail. In this case, try to get
> +	 * these pages from the peers.
> +	 */
> +	//FIXME: this placement is horrible
> +	if (ret < 0 && peer_mem_flags & IB_PEER_MEM_ALLOW) {
> +		struct ib_umem *new_umem;
> +
> +		new_umem = ib_peer_umem_get(umem, ret, peer_mem_flags);
> +		if (IS_ERR(new_umem)) {
> +			ret = PTR_ERR(new_umem);
> +			goto vma;
> +		}
> +		umem = new_umem;
> +		ret = 0;
> +		goto out;
> +	}
>   vma:
>   	atomic64_sub(ib_umem_num_pages(umem), &mm->pinned_vm);
>   out:
> @@ -320,8 +341,23 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
>   	}
>   	return ret ? ERR_PTR(ret) : umem;
>   }
> +
> +struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
> +			    size_t size, int access)
> +{
> +	return __ib_umem_get(device, addr, size, access, 0);
> +}
>   EXPORT_SYMBOL(ib_umem_get);
>   
> +struct ib_umem *ib_umem_get_peer(struct ib_device *device, unsigned long addr,
> +				 size_t size, int access,
> +				 unsigned long peer_mem_flags)
> +{
> +	return __ib_umem_get(device, addr, size, access,
> +			     IB_PEER_MEM_ALLOW | peer_mem_flags);
> +}
> +EXPORT_SYMBOL(ib_umem_get_peer);
> +
>   /**
>    * ib_umem_release - release memory pinned with ib_umem_get
>    * @umem: umem struct to release
> @@ -333,6 +369,8 @@ void ib_umem_release(struct ib_umem *umem)
>   	if (umem->is_odp)
>   		return ib_umem_odp_release(to_ib_umem_odp(umem));
>   
> +	if (umem->is_peer)
> +		return ib_peer_umem_release(umem);
>   	__ib_umem_release(umem->ibdev, umem, 1);
>   
>   	atomic64_sub(ib_umem_num_pages(umem), &umem->owning_mm->pinned_vm);
> diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
> index 2f5ee37c252b..cd2241bb865a 100644
> --- a/drivers/infiniband/hw/mlx5/cq.c
> +++ b/drivers/infiniband/hw/mlx5/cq.c
> @@ -733,8 +733,9 @@ static int create_cq_user(struct mlx5_ib_dev *dev, struct ib_udata *udata,
>   	*cqe_size = ucmd.cqe_size;
>   
>   	cq->buf.umem =
> -		ib_umem_get(&dev->ib_dev, ucmd.buf_addr,
> -			    entries * ucmd.cqe_size, IB_ACCESS_LOCAL_WRITE);
> +		ib_umem_get_peer(&dev->ib_dev, ucmd.buf_addr,
> +				 entries * ucmd.cqe_size,
> +				 IB_ACCESS_LOCAL_WRITE, 0);
>   	if (IS_ERR(cq->buf.umem)) {
>   		err = PTR_ERR(cq->buf.umem);
>   		return err;
> @@ -1132,9 +1133,9 @@ static int resize_user(struct mlx5_ib_dev *dev, struct mlx5_ib_cq *cq,
>   	if (ucmd.cqe_size && SIZE_MAX / ucmd.cqe_size <= entries - 1)
>   		return -EINVAL;
>   
> -	umem = ib_umem_get(&dev->ib_dev, ucmd.buf_addr,
> -			   (size_t)ucmd.cqe_size * entries,
> -			   IB_ACCESS_LOCAL_WRITE);
> +	umem = ib_umem_get_peer(&dev->ib_dev, ucmd.buf_addr,
> +				(size_t)ucmd.cqe_size * entries,
> +				IB_ACCESS_LOCAL_WRITE, 0);
>   	if (IS_ERR(umem)) {
>   		err = PTR_ERR(umem);
>   		return err;
> diff --git a/drivers/infiniband/hw/mlx5/devx.c b/drivers/infiniband/hw/mlx5/devx.c
> index c3b4b6586d17..f8f8507c7938 100644
> --- a/drivers/infiniband/hw/mlx5/devx.c
> +++ b/drivers/infiniband/hw/mlx5/devx.c
> @@ -2143,7 +2143,7 @@ static int devx_umem_get(struct mlx5_ib_dev *dev, struct ib_ucontext *ucontext,
>   	if (err)
>   		return err;
>   
> -	obj->umem = ib_umem_get(&dev->ib_dev, addr, size, access);
> +	obj->umem = ib_umem_get_peer(&dev->ib_dev, addr, size, access, 0);
>   	if (IS_ERR(obj->umem))
>   		return PTR_ERR(obj->umem);
>   
> diff --git a/drivers/infiniband/hw/mlx5/doorbell.c b/drivers/infiniband/hw/mlx5/doorbell.c
> index 61475b571531..a2a7e121ee5f 100644
> --- a/drivers/infiniband/hw/mlx5/doorbell.c
> +++ b/drivers/infiniband/hw/mlx5/doorbell.c
> @@ -64,8 +64,8 @@ int mlx5_ib_db_map_user(struct mlx5_ib_ucontext *context,
>   
>   	page->user_virt = (virt & PAGE_MASK);
>   	page->refcnt    = 0;
> -	page->umem = ib_umem_get(context->ibucontext.device, virt & PAGE_MASK,
> -				 PAGE_SIZE, 0);
> +	page->umem = ib_umem_get_peer(context->ibucontext.device, virt & PAGE_MASK,
> +				      PAGE_SIZE, 0, 0);
>   	if (IS_ERR(page->umem)) {
>   		err = PTR_ERR(page->umem);
>   		kfree(page);
> diff --git a/drivers/infiniband/hw/mlx5/mem.c b/drivers/infiniband/hw/mlx5/mem.c
> index b5aece786b36..174567af5ddd 100644
> --- a/drivers/infiniband/hw/mlx5/mem.c
> +++ b/drivers/infiniband/hw/mlx5/mem.c
> @@ -55,16 +55,17 @@ void mlx5_ib_cont_pages(struct ib_umem *umem, u64 addr,
>   	int i = 0;
>   	struct scatterlist *sg;
>   	int entry;
> +	int page_shift = umem->is_peer ? umem->page_shift : PAGE_SHIFT;
>   
> -	addr = addr >> PAGE_SHIFT;
> +	addr = addr >> page_shift;
>   	tmp = (unsigned long)addr;
>   	m = find_first_bit(&tmp, BITS_PER_LONG);
>   	if (max_page_shift)
> -		m = min_t(unsigned long, max_page_shift - PAGE_SHIFT, m);
> +		m = min_t(unsigned long, max_page_shift - page_shift, m);
>   
>   	for_each_sg(umem->sg_head.sgl, sg, umem->nmap, entry) {
> -		len = sg_dma_len(sg) >> PAGE_SHIFT;
> -		pfn = sg_dma_address(sg) >> PAGE_SHIFT;
> +		len = sg_dma_len(sg) >> page_shift;
> +		pfn = sg_dma_address(sg) >> page_shift;
>   		if (base + p != pfn) {
>   			/* If either the offset or the new
>   			 * base are unaligned update m
> @@ -96,7 +97,7 @@ void mlx5_ib_cont_pages(struct ib_umem *umem, u64 addr,
>   
>   		*ncont = 0;
>   	}
> -	*shift = PAGE_SHIFT + m;
> +	*shift = page_shift + m;
>   	*count = i;
>   }
>   
> diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
> index 24daf420317e..2d075ca40bfc 100644
> --- a/drivers/infiniband/hw/mlx5/mr.c
> +++ b/drivers/infiniband/hw/mlx5/mr.c
> @@ -41,6 +41,8 @@
>   #include <rdma/ib_verbs.h>
>   #include "mlx5_ib.h"
>   
> +static void mlx5_invalidate_umem(struct ib_umem *umem, void *priv);
> +
>   enum {
>   	MAX_PENDING_REG_MR = 8,
>   };
> @@ -754,7 +756,7 @@ static int mr_cache_max_order(struct mlx5_ib_dev *dev)
>   
>   static int mr_umem_get(struct mlx5_ib_dev *dev, u64 start, u64 length,
>   		       int access_flags, struct ib_umem **umem, int *npages,
> -		       int *page_shift, int *ncont, int *order)
> +		       int *page_shift, int *ncont, int *order, bool allow_peer)
>   {
>   	struct ib_umem *u;
>   
> @@ -779,7 +781,13 @@ static int mr_umem_get(struct mlx5_ib_dev *dev, u64 start, u64 length,
>   		if (order)
>   			*order = ilog2(roundup_pow_of_two(*ncont));
>   	} else {
> -		u = ib_umem_get(&dev->ib_dev, start, length, access_flags);
> +		if (allow_peer)
> +			u = ib_umem_get_peer(&dev->ib_dev, start, length,
> +					     access_flags,
> +					     IB_PEER_MEM_INVAL_SUPP);
> +		else
> +			u = ib_umem_get(&dev->ib_dev, start, length,
> +					access_flags);
>   		if (IS_ERR(u)) {
>   			mlx5_ib_dbg(dev, "umem get failed (%ld)\n", PTR_ERR(u));
>   			return PTR_ERR(u);
> @@ -1280,7 +1288,7 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
>   	}
>   
>   	err = mr_umem_get(dev, start, length, access_flags, &umem,
> -			  &npages, &page_shift, &ncont, &order);
> +			  &npages, &page_shift, &ncont, &order, true);
>   
>   	if (err < 0)
>   		return ERR_PTR(err);
> @@ -1335,6 +1343,12 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
>   		}
>   	}
>   
> +	if (umem->is_peer) {
> +		ib_umem_activate_invalidation_notifier(
> +			umem, mlx5_invalidate_umem, mr);
> +		/* After this point the MR can be invalidated */
> +	}
> +
>   	if (is_odp_mr(mr)) {
>   		to_ib_umem_odp(mr->umem)->private = mr;
>   		atomic_set(&mr->num_pending_prefetch, 0);
> @@ -1412,6 +1426,10 @@ int mlx5_ib_rereg_user_mr(struct ib_mr *ib_mr, int flags, u64 start,
>   
>   	atomic_sub(mr->npages, &dev->mdev->priv.reg_pages);
>   
> +	/* Peer memory isn't supported */
> +	if (mr->umem->is_peer)
> +		return -EOPNOTSUPP;
> +
>   	if (!mr->umem)
>   		return -EINVAL;
>   
> @@ -1435,7 +1453,7 @@ int mlx5_ib_rereg_user_mr(struct ib_mr *ib_mr, int flags, u64 start,
>   		ib_umem_release(mr->umem);
>   		mr->umem = NULL;
>   		err = mr_umem_get(dev, addr, len, access_flags, &mr->umem,
> -				  &npages, &page_shift, &ncont, &order);
> +				  &npages, &page_shift, &ncont, &order, false);
>   		if (err)
>   			goto err;
>   	}
> @@ -1615,13 +1633,14 @@ static void dereg_mr(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr)
>   	 * We should unregister the DMA address from the HCA before
>   	 * remove the DMA mapping.
>   	 */
> -	mlx5_mr_cache_free(dev, mr);
> +	if (mr->allocated_from_cache)
> +		mlx5_mr_cache_free(dev, mr);
> +	else
> +		kfree(mr);
> +
>   	ib_umem_release(umem);
>   	if (umem)
>   		atomic_sub(npages, &dev->mdev->priv.reg_pages);
> -
> -	if (!mr->allocated_from_cache)
> -		kfree(mr);
>   }
>   
>   int mlx5_ib_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
> @@ -2331,3 +2350,15 @@ int mlx5_ib_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents,
>   
>   	return n;
>   }
> +
> +static void mlx5_invalidate_umem(struct ib_umem *umem, void *priv)
> +{
> +	struct mlx5_ib_mr *mr = priv;
> +
> +	/*
> +	 * DMA is turned off for the mkey, but the mkey remains otherwise
> +	 * untouched until the normal flow of dereg_mr happens. Any access to
> +	 * this mkey will generate CQEs.
> +	 */
> +	unreg_umr(mr->dev ,mr);
> +}
> diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
> index 45faab9e1313..be59c6d5ba1c 100644
> --- a/drivers/infiniband/hw/mlx5/qp.c
> +++ b/drivers/infiniband/hw/mlx5/qp.c
> @@ -749,7 +749,7 @@ static int mlx5_ib_umem_get(struct mlx5_ib_dev *dev, struct ib_udata *udata,
>   {
>   	int err;
>   
> -	*umem = ib_umem_get(&dev->ib_dev, addr, size, 0);
> +	*umem = ib_umem_get_peer(&dev->ib_dev, addr, size, 0, 0);
>   	if (IS_ERR(*umem)) {
>   		mlx5_ib_dbg(dev, "umem_get failed\n");
>   		return PTR_ERR(*umem);
> diff --git a/drivers/infiniband/hw/mlx5/srq.c b/drivers/infiniband/hw/mlx5/srq.c
> index 6d1ff13d2283..2f55f7e1923d 100644
> --- a/drivers/infiniband/hw/mlx5/srq.c
> +++ b/drivers/infiniband/hw/mlx5/srq.c
> @@ -80,7 +80,7 @@ static int create_srq_user(struct ib_pd *pd, struct mlx5_ib_srq *srq,
>   
>   	srq->wq_sig = !!(ucmd.flags & MLX5_SRQ_FLAG_SIGNATURE);
>   
> -	srq->umem = ib_umem_get(pd->device, ucmd.buf_addr, buf_size, 0);
> +	srq->umem = ib_umem_get_peer(pd->device, ucmd.buf_addr, buf_size, 0, 0);
>   	if (IS_ERR(srq->umem)) {
>   		mlx5_ib_dbg(dev, "failed umem get, size %d\n", buf_size);
>   		err = PTR_ERR(srq->umem);
> diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
> index 9353910915d4..ec9824cbf49d 100644
> --- a/include/rdma/ib_umem.h
> +++ b/include/rdma/ib_umem.h
> @@ -48,10 +48,19 @@ struct ib_umem {
>   	unsigned long		address;
>   	u32 writable : 1;
>   	u32 is_odp : 1;
> +	/* Placing at the end of the bitfield list is ABI preserving on LE */
> +	u32 is_peer : 1;
>   	struct work_struct	work;
>   	struct sg_table sg_head;
>   	int             nmap;
>   	unsigned int    sg_nents;
> +	unsigned int    page_shift;
> +};
> +
> +typedef void (*umem_invalidate_func_t)(struct ib_umem *umem, void *priv);
> +enum ib_peer_mem_flags {
> +	IB_PEER_MEM_ALLOW = 1 << 0,
> +	IB_PEER_MEM_INVAL_SUPP = 1 << 1,
>   };
>   
>   /* Returns the offset of the umem start relative to the first page. */
> @@ -79,6 +88,13 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
>   				     unsigned long pgsz_bitmap,
>   				     unsigned long virt);
>   
> +struct ib_umem *ib_umem_get_peer(struct ib_device *device, unsigned long addr,
> +				 size_t size, int access,
> +				 unsigned long peer_mem_flags);
> +void ib_umem_activate_invalidation_notifier(struct ib_umem *umem,
> +					    umem_invalidate_func_t func,
> +					    void *cookie);
> +
>   #else /* CONFIG_INFINIBAND_USER_MEM */
>   
>   #include <linux/err.h>
> @@ -102,6 +118,19 @@ static inline unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
>   	return 0;
>   }
>   
> +static inline struct ib_umem *ib_umem_get_peer(struct ib_device *device,
> +					       unsigned long addr, size_t size,
> +					       int access,
> +					       unsigned long peer_mem_flags)
> +{
> +	return ERR_PTR(-EINVAL);
> +}
> +
> +static inline void ib_umem_activate_invalidation_notifier(
> +	struct ib_umem *umem, umem_invalidate_func_t func, void *cookie)
> +{
> +}
> +
>   #endif /* CONFIG_INFINIBAND_USER_MEM */
>   
>   #endif /* IB_UMEM_H */
> diff --git a/include/rdma/peer_mem.h b/include/rdma/peer_mem.h
> new file mode 100644
> index 000000000000..563a820dbc32
> --- /dev/null
> +++ b/include/rdma/peer_mem.h
> @@ -0,0 +1,165 @@
> +/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
> +/*
> + * Copyright (c) 2014-2020,  Mellanox Technologies. All rights reserved.
> + */
> +#ifndef RDMA_PEER_MEM_H
> +#define RDMA_PEER_MEM_H
> +
> +#include <linux/scatterlist.h>
> +
> +#define IB_PEER_MEMORY_NAME_MAX 64
> +#define IB_PEER_MEMORY_VER_MAX 16
> +
> +/*
> + * Prior versions used a void * for core_context, at some point this was
> + * switched to use u64. Be careful if compiling this as 32 bit. To help the
> + * value of core_context is limited to u32 so it should work OK despite the
> + * type change.
> + */
> +#define PEER_MEM_U64_CORE_CONTEXT
> +
> +struct device;
> +
> +/**
> + *  struct peer_memory_client - registration information for user virtual
> + *                              memory handlers
> + *
> + * The peer_memory_client scheme allows a driver to register with the ib_umem
> + * system that it has the ability to understand user virtual address ranges
> + * that are not compatible with get_user_pages(). For instance VMAs created
> + * with io_remap_pfn_range(), or other driver special VMA.
> + *
> + * For ranges the interface understands it can provide a DMA mapped sg_table
> + * for use by the ib_umem, allowing user virtual ranges that cannot be
> + * supported by get_user_pages() to be used as umems.
> + */
> +struct peer_memory_client {
> +	char name[IB_PEER_MEMORY_NAME_MAX];
> +	char version[IB_PEER_MEMORY_VER_MAX];
> +
> +	/**
> +	 * acquire - Begin working with a user space virtual address range
> +	 *
> +	 * @addr - Virtual address to be checked whether belongs to peer.
> +	 * @size - Length of the virtual memory area starting at addr.
> +	 * @peer_mem_private_data - Obsolete, always NULL
> +	 * @peer_mem_name - Obsolete, always NULL
> +	 * @client_context - Returns an opaque value for this acquire use in
> +	 *                   other APIs
> +	 *
> +	 * Returns 1 if the peer_memory_client supports the entire virtual
> +	 * address range, 0 or -ERRNO otherwise. If 1 is returned then
> +	 * release() will be called to release the acquire().
> +	 */
> +	int (*acquire)(unsigned long addr, size_t size,
> +		       void *peer_mem_private_data, char *peer_mem_name,
> +		       void **client_context);
> +	/**
> +	 * get_pages - Fill in the first part of a sg_table for a virtual
> +	 *             address range
> +	 *
> +	 * @addr - Virtual address to be checked whether belongs to peer.
> +	 * @size - Length of the virtual memory area starting at addr.
> +	 * @write - Always 1
> +	 * @force - 1 if write is required
> +	 * @sg_head - Obsolete, always NULL
> +	 * @client_context - Value returned by acquire()
> +	 * @core_context - Value to be passed to invalidate_peer_memory for
> +	 *                 this get
> +	 *
> +	 * addr/size are passed as the raw virtual address range requested by
> +	 * the user, it is not aligned to any page size. get_pages() is always
> +	 * followed by dma_map().
> +	 *
> +	 * Upon return the caller can call the invalidate_callback().
> +	 *
> +	 * Returns 0 on success, -ERRNO on failure. After success put_pages()
> +	 * will be called to return the pages.
> +	 */
> +	int (*get_pages)(unsigned long addr, size_t size, int write, int force,
> +			 struct sg_table *sg_head, void *client_context,
> +			 u64 core_context);
> +	/**
> +	 * dma_map - Create a DMA mapped sg_table
> +	 *
> +	 * @sg_head - The sg_table to allocate
> +	 * @client_context - Value returned by acquire()
> +	 * @dma_device - The device that will be doing DMA from these addresses
> +	 * @dmasync - Obsolete, always 0
> +	 * @nmap - Returns the number of dma mapped entries in the sg_head
> +	 *
> +	 * Must be called after get_pages(). This must fill in the sg_head with
> +	 * DMA mapped SGLs for dma_device. Each SGL start and end must meet a
> +	 * minimum alignment of at least PAGE_SIZE, though individual sgls can
> +	 * be multiples of PAGE_SIZE, in any mixture. Since the user virtual
> +	 * address/size are not page aligned, the implementation must increase
> +	 * it to the logical alignment when building the SGLs.
> +	 *
> +	 * Returns 0 on success, -ERRNO on failure. After success dma_unmap()
> +	 * will be called to unmap the pages. On failure sg_head must be left
> +	 * untouched or point to a valid sg_table.
> +	 */
> +	int (*dma_map)(struct sg_table *sg_head, void *client_context,
> +		       struct device *dma_device, int dmasync, int *nmap);
> +	/**
> +	 * dma_unmap - Unmap a DMA mapped sg_table
> +	 *
> +	 * @sg_head - The sg_table to unmap
> +	 * @client_context - Value returned by acquire()
> +	 * @dma_device - The device that will be doing DMA from these addresses
> +	 *
> +	 * sg_head will not be touched after this function returns.
> +	 *
> +	 * Must return 0.
> +	 */
> +	int (*dma_unmap)(struct sg_table *sg_head, void *client_context,
> +			 struct device *dma_device);
> +	/**
> +	 * put_pages - Unpin a SGL
> +	 *
> +	 * @sg_head - The sg_table to unpin
> +	 * @client_context - Value returned by acquire()
> +	 *
> +	 * sg_head must be freed on return.
> +	 */
> +	void (*put_pages)(struct sg_table *sg_head, void *client_context);
> +	/* Obsolete, not used */
> +	unsigned long (*get_page_size)(void *client_context);
> +	/**
> +	 * release - Undo acquire
> +	 *
> +	 * @client_context - Value returned by acquire()
> +	 *
> +	 * If acquire() returns 1 then release() must be called. All
> +	 * get_pages() and dma_map()'s must be undone before calling this
> +	 * function.
> +	 */
> +	void (*release)(void *client_context);
> +};
> +
> +/*
> + * If invalidate_callback() is non-NULL then the client will only support
> + * umems which can be invalidated. The caller may call the
> + * invalidate_callback() after acquire() on return the range will no longer
> + * have DMA active, and release() will have been called.
> + *
> + * Note: The implementation locking must ensure that get_pages(), and
> + * dma_map() do not have locking dependencies with invalidate_callback(). The
> + * ib_core will wait until any concurrent get_pages() or dma_map() completes
> + * before returning.
> + *
> + * Similarly, this can call dma_unmap(), put_pages() and release() from within
> + * the callback, or will wait for another thread doing those operations to
> + * complete.
> + *
> + * For these reasons the user of invalidate_callback() must be careful with
> + * locking.
> + */
> +typedef int (*invalidate_peer_memory)(void *reg_handle, u64 core_context);
> +
> +void *
> +ib_register_peer_memory_client(const struct peer_memory_client *peer_client,
> +			       invalidate_peer_memory *invalidate_callback);
> +void ib_unregister_peer_memory_client(void *reg_handle);
> +
> +#endif
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20210831/561beb35/attachment-0001.sig>