[RFC][Focal][PATCH 2/2] RDMA/core: Introduce peer memory interface
dann frazier
dann.frazier at canonical.com
Tue Aug 31 16:59:13 UTC 2021
On Tue, Aug 31, 2021 at 02:34:37PM +0200, Stefan Bader wrote:
> On 31.08.21 00:50, dann frazier wrote:
> > From: Feras Daoud <ferasda at mellanox.com>
> >
> > BugLink: https://bugs.launchpad.net/bugs/1923104
> >
> > The peer_memory_client scheme allows a driver to register with the ib_umem
> > system that it has the ability to understand user virtual address ranges
> > that are not compatible with get_user_pages(). For instance VMAs created
> > with io_remap_pfn_range(), or other driver special VMA.
> >
> > For ranges the interface understands it can provide a DMA mapped sg_table
> > for use by the ib_umem, allowing user virtual ranges that cannot be
> > supported by get_user_pages() to be used as umems for RDMA.
> >
> > This is designed to preserve the kABI, no functions or structures are
> > changed, only new symbols are added:
> >
> > ib_register_peer_memory_client
> > ib_unregister_peer_memory_client
> > ib_umem_activate_invalidation_notifier
> > ib_umem_get_peer
> >
> > And a bitfield in struct ib_umem uses more bits.
> >
> > This interface is compatible with the two out of tree GPU drivers:
> > https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/blob/master/drivers/gpu/drm/amd/amdkfd/kfd_peerdirect.c
> > https://github.com/Mellanox/nv_peer_memory/blob/master/nv_peer_mem.c
> >
> > NOTES (remote before sending):
> > - The exact locking semantics from the GPU side during invalidation
> > are confusing. I've made it sane but perhaps this will hit locking
> > problems. Test with lockdep and test invalidation.
> >
> > The main difference here is that get_pages and dma_map are called
> > from a context that will block progress of invalidation.
> >
> > The old design blocked progress of invalidation using a completion for
> > unmap and unpin, so those should be proven safe now.
> >
> > Since the old design used a completion it doesn't work with lockdep,
> > even though it has basically the same blocking semantics.
> >
> > - The API exported to the GPU side is crufty and makes very little
> > sense. Functionally it should be the same still, but many useless
> > things were dropped off
> >
> > - I rewrote all the comments please check spelling/grammar
> >
> > - Compile tested only
> >
> > Issue: 2189651
> > Change-Id: I1d77f52d56aec2c79e6b9d9ec1096e83a95155cd
>
> I am assuming this is some import from some vendor tree (and as such would
> be a SAUCE patch).
Yes, I will correct that.
> But this also brings up the question about upstream. If
> it is upstream but as a sequence of patches, then those would be rather what
> a LTS kernel should backport. If this still is not upstream this will remain
> a liability for all upcoming releases until it is. Either as a SAUCE patch
> that has to be adapted while carrying forward or something that potentially
> gets forgotten.
It is not upstream and AIUI never will be in this form. I believe the
longterm plan is to migrate users to dma-buf/p2pdma when possible
upstream.
> And even if it is a SAUCE patch in Impish already,
fyi, it is SAUCE in impish and hirsute today
> there is always a chance that upstream changes in ways that cause us
> more and more pain.
Yep, understood. Note that we have agreement from the developers to
assist with porting if/when necessary.
-dann
> -Stefan
>
> > Signed-off-by: Yishai Hadas <yishaih at mellanox.com>
> > Signed-off-by: Feras Daoud <ferasda at mellanox.com>
> > Signed-off-by: Jason Gunthorpe <jgg at mellanox.com>
> > Signed-off-by: dann frazier <dann.frazier at canonical.com>
> > ---
> > drivers/infiniband/core/Makefile | 2 +-
> > drivers/infiniband/core/ib_peer_mem.h | 52 +++
> > drivers/infiniband/core/peer_mem.c | 484 ++++++++++++++++++++++++++
> > drivers/infiniband/core/umem.c | 44 ++-
> > drivers/infiniband/hw/mlx5/cq.c | 11 +-
> > drivers/infiniband/hw/mlx5/devx.c | 2 +-
> > drivers/infiniband/hw/mlx5/doorbell.c | 4 +-
> > drivers/infiniband/hw/mlx5/mem.c | 11 +-
> > drivers/infiniband/hw/mlx5/mr.c | 47 ++-
> > drivers/infiniband/hw/mlx5/qp.c | 2 +-
> > drivers/infiniband/hw/mlx5/srq.c | 2 +-
> > include/rdma/ib_umem.h | 29 ++
> > include/rdma/peer_mem.h | 165 +++++++++
> > 13 files changed, 828 insertions(+), 27 deletions(-)
> > create mode 100644 drivers/infiniband/core/ib_peer_mem.h
> > create mode 100644 drivers/infiniband/core/peer_mem.c
> > create mode 100644 include/rdma/peer_mem.h
> >
> > diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
> > index 9a8871e21545..4b7838ff6e90 100644
> > --- a/drivers/infiniband/core/Makefile
> > +++ b/drivers/infiniband/core/Makefile
> > @@ -34,5 +34,5 @@ ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_marshall.o \
> > uverbs_std_types_flow_action.o uverbs_std_types_dm.o \
> > uverbs_std_types_mr.o uverbs_std_types_counters.o \
> > uverbs_uapi.o uverbs_std_types_device.o
> > -ib_uverbs-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
> > +ib_uverbs-$(CONFIG_INFINIBAND_USER_MEM) += umem.o peer_mem.o
> > ib_uverbs-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o
> > diff --git a/drivers/infiniband/core/ib_peer_mem.h b/drivers/infiniband/core/ib_peer_mem.h
> > new file mode 100644
> > index 000000000000..bb38ffee724a
> > --- /dev/null
> > +++ b/drivers/infiniband/core/ib_peer_mem.h
> > @@ -0,0 +1,52 @@
> > +/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
> > +/*
> > + * Copyright (c) 2014-2020, Mellanox Technologies. All rights reserved.
> > + */
> > +#ifndef RDMA_IB_PEER_MEM_H
> > +#define RDMA_IB_PEER_MEM_H
> > +
> > +#include <rdma/peer_mem.h>
> > +#include <linux/kobject.h>
> > +#include <linux/xarray.h>
> > +#include <rdma/ib_umem.h>
> > +
> > +struct ib_peer_memory_statistics {
> > + atomic64_t num_alloc_mrs;
> > + atomic64_t num_dealloc_mrs;
> > + atomic64_t num_reg_pages;
> > + atomic64_t num_dereg_pages;
> > + atomic64_t num_reg_bytes;
> > + atomic64_t num_dereg_bytes;
> > + unsigned long num_free_callbacks;
> > +};
> > +
> > +struct ib_peer_memory_client {
> > + struct kobject kobj;
> > + refcount_t usecnt;
> > + struct completion usecnt_zero;
> > + const struct peer_memory_client *peer_mem;
> > + struct list_head core_peer_list;
> > + struct ib_peer_memory_statistics stats;
> > + struct xarray umem_xa;
> > + u32 xa_cyclic_next;
> > + bool invalidation_required;
> > +};
> > +
> > +struct ib_umem_peer {
> > + struct ib_umem umem;
> > + struct kref kref;
> > + /* peer memory that manages this umem */
> > + struct ib_peer_memory_client *ib_peer_client;
> > + void *peer_client_context;
> > + umem_invalidate_func_t invalidation_func;
> > + void *invalidation_private;
> > + struct mutex mapping_lock;
> > + bool mapped;
> > + u32 xa_id;
> > +};
> > +
> > +struct ib_umem *ib_peer_umem_get(struct ib_umem *old_umem, int old_ret,
> > + unsigned long peer_mem_flags);
> > +void ib_peer_umem_release(struct ib_umem *umem);
> > +
> > +#endif
> > diff --git a/drivers/infiniband/core/peer_mem.c b/drivers/infiniband/core/peer_mem.c
> > new file mode 100644
> > index 000000000000..833865578cb0
> > --- /dev/null
> > +++ b/drivers/infiniband/core/peer_mem.c
> > @@ -0,0 +1,484 @@
> > +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
> > +/*
> > + * Copyright (c) 2014-2020, Mellanox Technologies. All rights reserved.
> > + */
> > +#include <rdma/ib_verbs.h>
> > +#include <rdma/ib_umem.h>
> > +#include <linux/sched/mm.h>
> > +#include "ib_peer_mem.h"
> > +static DEFINE_MUTEX(peer_memory_mutex);
> > +static LIST_HEAD(peer_memory_list);
> > +static struct kobject *peers_kobj;
> > +#define PEER_NO_INVALIDATION_ID U32_MAX
> > +static int ib_invalidate_peer_memory(void *reg_handle, u64 core_context);
> > +struct peer_mem_attribute {
> > + struct attribute attr;
> > + ssize_t (*show)(struct ib_peer_memory_client *ib_peer_client,
> > + struct peer_mem_attribute *attr, char *buf);
> > + ssize_t (*store)(struct ib_peer_memory_client *ib_peer_client,
> > + struct peer_mem_attribute *attr, const char *buf,
> > + size_t count);
> > +};
> > +
> > +#define PEER_ATTR_RO(_name) \
> > + struct peer_mem_attribute peer_attr_ ## _name = __ATTR_RO(_name)
> > +
> > +static ssize_t version_show(struct ib_peer_memory_client *ib_peer_client,
> > + struct peer_mem_attribute *attr, char *buf)
> > +{
> > + return scnprintf(buf, PAGE_SIZE, "%s\n",
> > + ib_peer_client->peer_mem->version);
> > +}
> > +
> > +static PEER_ATTR_RO(version);
> > +static ssize_t num_alloc_mrs_show(struct ib_peer_memory_client *ib_peer_client,
> > + struct peer_mem_attribute *attr, char *buf)
> > +{
> > + return scnprintf(
> > + buf, PAGE_SIZE, "%llu\n",
> > + (u64)atomic64_read(&ib_peer_client->stats.num_alloc_mrs));
> > +}
> > +
> > +static PEER_ATTR_RO(num_alloc_mrs);
> > +static ssize_t
> > +num_dealloc_mrs_show(struct ib_peer_memory_client *ib_peer_client,
> > + struct peer_mem_attribute *attr, char *buf)
> > +{
> > + return scnprintf(
> > + buf, PAGE_SIZE, "%llu\n",
> > + (u64)atomic64_read(&ib_peer_client->stats.num_dealloc_mrs));
> > +}
> > +
> > +static PEER_ATTR_RO(num_dealloc_mrs);
> > +static ssize_t num_reg_pages_show(struct ib_peer_memory_client *ib_peer_client,
> > + struct peer_mem_attribute *attr, char *buf)
> > +{
> > + return scnprintf(
> > + buf, PAGE_SIZE, "%llu\n",
> > + (u64)atomic64_read(&ib_peer_client->stats.num_reg_pages));
> > +}
> > +
> > +static PEER_ATTR_RO(num_reg_pages);
> > +static ssize_t
> > +num_dereg_pages_show(struct ib_peer_memory_client *ib_peer_client,
> > + struct peer_mem_attribute *attr, char *buf)
> > +{
> > + return scnprintf(
> > + buf, PAGE_SIZE, "%llu\n",
> > + (u64)atomic64_read(&ib_peer_client->stats.num_dereg_pages));
> > +}
> > +
> > +static PEER_ATTR_RO(num_dereg_pages);
> > +static ssize_t num_reg_bytes_show(struct ib_peer_memory_client *ib_peer_client,
> > + struct peer_mem_attribute *attr, char *buf)
> > +{
> > + return scnprintf(
> > + buf, PAGE_SIZE, "%llu\n",
> > + (u64)atomic64_read(&ib_peer_client->stats.num_reg_bytes));
> > +}
> > +
> > +static PEER_ATTR_RO(num_reg_bytes);
> > +static ssize_t
> > +num_dereg_bytes_show(struct ib_peer_memory_client *ib_peer_client,
> > + struct peer_mem_attribute *attr, char *buf)
> > +{
> > + return scnprintf(
> > + buf, PAGE_SIZE, "%llu\n",
> > + (u64)atomic64_read(&ib_peer_client->stats.num_dereg_bytes));
> > +}
> > +
> > +static PEER_ATTR_RO(num_dereg_bytes);
> > +static ssize_t
> > +num_free_callbacks_show(struct ib_peer_memory_client *ib_peer_client,
> > + struct peer_mem_attribute *attr, char *buf)
> > +{
> > + return scnprintf(buf, PAGE_SIZE, "%lu\n",
> > + ib_peer_client->stats.num_free_callbacks);
> > +}
> > +
> > +static PEER_ATTR_RO(num_free_callbacks);
> > +static struct attribute *peer_mem_attrs[] = {
> > + &peer_attr_version.attr,
> > + &peer_attr_num_alloc_mrs.attr,
> > + &peer_attr_num_dealloc_mrs.attr,
> > + &peer_attr_num_reg_pages.attr,
> > + &peer_attr_num_dereg_pages.attr,
> > + &peer_attr_num_reg_bytes.attr,
> > + &peer_attr_num_dereg_bytes.attr,
> > + &peer_attr_num_free_callbacks.attr,
> > + NULL,
> > +};
> > +
> > +static const struct attribute_group peer_mem_attr_group = {
> > + .attrs = peer_mem_attrs,
> > +};
> > +
> > +static ssize_t peer_attr_show(struct kobject *kobj, struct attribute *attr,
> > + char *buf)
> > +{
> > + struct peer_mem_attribute *peer_attr =
> > + container_of(attr, struct peer_mem_attribute, attr);
> > + if (!peer_attr->show)
> > + return -EIO;
> > + return peer_attr->show(container_of(kobj, struct ib_peer_memory_client,
> > + kobj),
> > + peer_attr, buf);
> > +}
> > +
> > +static const struct sysfs_ops peer_mem_sysfs_ops = {
> > + .show = peer_attr_show,
> > +};
> > +
> > +static void ib_peer_memory_client_release(struct kobject *kobj)
> > +{
> > + struct ib_peer_memory_client *ib_peer_client =
> > + container_of(kobj, struct ib_peer_memory_client, kobj);
> > + kfree(ib_peer_client);
> > +}
> > +
> > +static struct kobj_type peer_mem_type = {
> > + .sysfs_ops = &peer_mem_sysfs_ops,
> > + .release = ib_peer_memory_client_release,
> > +};
> > +
> > +static int ib_memory_peer_check_mandatory(const struct peer_memory_client
> > + *peer_client)
> > +{
> > +#define PEER_MEM_MANDATORY_FUNC(x) {offsetof(struct peer_memory_client, x), #x}
> > + int i;
> > + static const struct {
> > + size_t offset;
> > + char *name;
> > + } mandatory_table[] = {
> > + PEER_MEM_MANDATORY_FUNC(acquire),
> > + PEER_MEM_MANDATORY_FUNC(get_pages),
> > + PEER_MEM_MANDATORY_FUNC(put_pages),
> > + PEER_MEM_MANDATORY_FUNC(dma_map),
> > + PEER_MEM_MANDATORY_FUNC(dma_unmap),
> > + };
> > + for (i = 0; i < ARRAY_SIZE(mandatory_table); ++i) {
> > + if (!*(void **)((void *)peer_client +
> > + mandatory_table[i].offset)) {
> > + pr_err("Peer memory %s is missing mandatory function %s\n",
> > + peer_client->name, mandatory_table[i].name);
> > + return -EINVAL;
> > + }
> > + }
> > + return 0;
> > +}
> > +
> > +void *
> > +ib_register_peer_memory_client(const struct peer_memory_client *peer_client,
> > + invalidate_peer_memory *invalidate_callback)
> > +{
> > + struct ib_peer_memory_client *ib_peer_client;
> > + int ret;
> > + if (ib_memory_peer_check_mandatory(peer_client))
> > + return NULL;
> > + ib_peer_client = kzalloc(sizeof(*ib_peer_client), GFP_KERNEL);
> > + if (!ib_peer_client)
> > + return NULL;
> > + kobject_init(&ib_peer_client->kobj, &peer_mem_type);
> > + refcount_set(&ib_peer_client->usecnt, 1);
> > + init_completion(&ib_peer_client->usecnt_zero);
> > + ib_peer_client->peer_mem = peer_client;
> > + xa_init_flags(&ib_peer_client->umem_xa, XA_FLAGS_ALLOC);
> > + /*
> > + * If the peer wants the invalidation_callback then all memory users
> > + * linked to that peer must support invalidation.
> > + */
> > + if (invalidate_callback) {
> > + *invalidate_callback = ib_invalidate_peer_memory;
> > + ib_peer_client->invalidation_required = true;
> > + }
> > + mutex_lock(&peer_memory_mutex);
> > + if (!peers_kobj) {
> > + /* Created under /sys/kernel/mm */
> > + peers_kobj = kobject_create_and_add("memory_peers", mm_kobj);
> > + if (!peers_kobj)
> > + goto err_unlock;
> > + }
> > + ret = kobject_add(&ib_peer_client->kobj, peers_kobj, peer_client->name);
> > + if (ret)
> > + goto err_parent;
> > + ret = sysfs_create_group(&ib_peer_client->kobj,
> > + &peer_mem_attr_group);
> > + if (ret)
> > + goto err_parent;
> > + list_add_tail(&ib_peer_client->core_peer_list, &peer_memory_list);
> > + mutex_unlock(&peer_memory_mutex);
> > + return ib_peer_client;
> > +err_parent:
> > + if (list_empty(&peer_memory_list)) {
> > + kobject_put(peers_kobj);
> > + peers_kobj = NULL;
> > + }
> > +err_unlock:
> > + mutex_unlock(&peer_memory_mutex);
> > + kobject_put(&ib_peer_client->kobj);
> > + return NULL;
> > +}
> > +EXPORT_SYMBOL(ib_register_peer_memory_client);
> > +
> > +void ib_unregister_peer_memory_client(void *reg_handle)
> > +{
> > + struct ib_peer_memory_client *ib_peer_client = reg_handle;
> > + mutex_lock(&peer_memory_mutex);
> > + list_del(&ib_peer_client->core_peer_list);
> > + if (list_empty(&peer_memory_list)) {
> > + kobject_put(peers_kobj);
> > + peers_kobj = NULL;
> > + }
> > + mutex_unlock(&peer_memory_mutex);
> > + /*
> > + * Wait for all umems to be destroyed before returning. Once
> > + * ib_unregister_peer_memory_client() returns no umems will call any
> > + * peer_mem ops.
> > + */
> > + if (refcount_dec_and_test(&ib_peer_client->usecnt))
> > + complete(&ib_peer_client->usecnt_zero);
> > + wait_for_completion(&ib_peer_client->usecnt_zero);
> > + kobject_put(&ib_peer_client->kobj);
> > +}
> > +EXPORT_SYMBOL(ib_unregister_peer_memory_client);
> > +
> > +static struct ib_peer_memory_client *
> > +ib_get_peer_client(unsigned long addr, size_t size,
> > + unsigned long peer_mem_flags, void **peer_client_context)
> > +{
> > + struct ib_peer_memory_client *ib_peer_client;
> > + int ret = 0;
> > + mutex_lock(&peer_memory_mutex);
> > + list_for_each_entry(ib_peer_client, &peer_memory_list,
> > + core_peer_list) {
> > + if (ib_peer_client->invalidation_required &&
> > + (!(peer_mem_flags & IB_PEER_MEM_INVAL_SUPP)))
> > + continue;
> > + ret = ib_peer_client->peer_mem->acquire(addr, size, NULL, NULL,
> > + peer_client_context);
> > + if (ret > 0) {
> > + refcount_inc(&ib_peer_client->usecnt);
> > + mutex_unlock(&peer_memory_mutex);
> > + return ib_peer_client;
> > + }
> > + }
> > + mutex_unlock(&peer_memory_mutex);
> > + return NULL;
> > +}
> > +
> > +static void ib_put_peer_client(struct ib_peer_memory_client *ib_peer_client,
> > + void *peer_client_context)
> > +{
> > + if (ib_peer_client->peer_mem->release)
> > + ib_peer_client->peer_mem->release(peer_client_context);
> > + if (refcount_dec_and_test(&ib_peer_client->usecnt))
> > + complete(&ib_peer_client->usecnt_zero);
> > +}
> > +
> > +static void ib_peer_umem_kref_release(struct kref *kref)
> > +{
> > + kfree(container_of(kref, struct ib_umem_peer, kref));
> > +}
> > +
> > +static void ib_unmap_peer_client(struct ib_umem_peer *umem_p)
> > +{
> > + struct ib_peer_memory_client *ib_peer_client = umem_p->ib_peer_client;
> > + const struct peer_memory_client *peer_mem = ib_peer_client->peer_mem;
> > + struct ib_umem *umem = &umem_p->umem;
> > +
> > + lockdep_assert_held(&umem_p->mapping_lock);
> > +
> > + peer_mem->dma_unmap(&umem_p->umem.sg_head, umem_p->peer_client_context,
> > + umem_p->umem.ibdev->dma_device);
> > + peer_mem->put_pages(&umem_p->umem.sg_head, umem_p->peer_client_context);
> > + memset(&umem->sg_head, 0, sizeof(umem->sg_head));
> > +
> > + atomic64_add(umem->nmap, &ib_peer_client->stats.num_dereg_pages);
> > + atomic64_add(umem->length, &ib_peer_client->stats.num_dereg_bytes);
> > + atomic64_inc(&ib_peer_client->stats.num_dealloc_mrs);
> > +
> > + if (umem_p->xa_id != PEER_NO_INVALIDATION_ID)
> > + xa_store(&ib_peer_client->umem_xa, umem_p->xa_id, NULL,
> > + GFP_KERNEL);
> > + umem_p->mapped = false;
> > +}
> > +
> > +static int ib_invalidate_peer_memory(void *reg_handle, u64 core_context)
> > +{
> > + struct ib_peer_memory_client *ib_peer_client = reg_handle;
> > + struct ib_umem_peer *umem_p;
> > +
> > + /*
> > + * The client is not required to fence against invalidation during
> > + * put_pages() as that would deadlock when we call put_pages() here.
> > + * Thus the core_context cannot be a umem pointer as we have no control
> > + * over the lifetime. Since we won't change the kABI for this to add a
> > + * proper kref, an xarray is used.
> > + */
> > + xa_lock(&ib_peer_client->umem_xa);
> > + ib_peer_client->stats.num_free_callbacks += 1;
> > + umem_p = xa_load(&ib_peer_client->umem_xa, core_context);
> > + if (!umem_p)
> > + goto out_unlock;
> > + kref_get(&umem_p->kref);
> > + xa_unlock(&ib_peer_client->umem_xa);
> > + mutex_lock(&umem_p->mapping_lock);
> > + if (umem_p->mapped) {
> > + /*
> > + * At this point the invalidation_func must be !NULL as the get
> > + * flow does not unlock mapping_lock until it is set, and umems
> > + * that do not require invalidation are not in the xarray.
> > + */
> > + umem_p->invalidation_func(&umem_p->umem,
> > + umem_p->invalidation_private);
> > + ib_unmap_peer_client(umem_p);
> > + }
> > + mutex_unlock(&umem_p->mapping_lock);
> > + kref_put(&umem_p->kref, ib_peer_umem_kref_release);
> > + return 0;
> > +out_unlock:
> > + xa_unlock(&ib_peer_client->umem_xa);
> > + return 0;
> > +}
> > +
> > +void ib_umem_activate_invalidation_notifier(struct ib_umem *umem,
> > + umem_invalidate_func_t func,
> > + void *priv)
> > +{
> > + struct ib_umem_peer *umem_p =
> > + container_of(umem, struct ib_umem_peer, umem);
> > +
> > + if (WARN_ON(!umem->is_peer))
> > + return;
> > + if (umem_p->xa_id == PEER_NO_INVALIDATION_ID)
> > + return;
> > +
> > + umem_p->invalidation_func = func;
> > + umem_p->invalidation_private = priv;
> > + /* Pairs with the lock in ib_peer_umem_get() */
> > + mutex_unlock(&umem_p->mapping_lock);
> > +
> > + /* At this point func can be called asynchronously */
> > +}
> > +EXPORT_SYMBOL(ib_umem_activate_invalidation_notifier);
> > +
> > +struct ib_umem *ib_peer_umem_get(struct ib_umem *old_umem, int old_ret,
> > + unsigned long peer_mem_flags)
> > +{
> > + struct ib_peer_memory_client *ib_peer_client;
> > + void *peer_client_context;
> > + struct ib_umem_peer *umem_p;
> > + int ret;
> > + ib_peer_client =
> > + ib_get_peer_client(old_umem->address, old_umem->length,
> > + peer_mem_flags, &peer_client_context);
> > + if (!ib_peer_client)
> > + return ERR_PTR(old_ret);
> > + umem_p = kzalloc(sizeof(*umem_p), GFP_KERNEL);
> > + if (!umem_p) {
> > + ret = -ENOMEM;
> > + goto err_client;
> > + }
> > +
> > + kref_init(&umem_p->kref);
> > + umem_p->umem = *old_umem;
> > + memset(&umem_p->umem.sg_head, 0, sizeof(umem_p->umem.sg_head));
> > + umem_p->umem.is_peer = 1;
> > + umem_p->ib_peer_client = ib_peer_client;
> > + umem_p->peer_client_context = peer_client_context;
> > + mutex_init(&umem_p->mapping_lock);
> > + umem_p->xa_id = PEER_NO_INVALIDATION_ID;
> > +
> > + mutex_lock(&umem_p->mapping_lock);
> > + if (ib_peer_client->invalidation_required) {
> > + ret = xa_alloc_cyclic(&ib_peer_client->umem_xa, &umem_p->xa_id,
> > + umem_p,
> > + XA_LIMIT(0, PEER_NO_INVALIDATION_ID - 1),
> > + &ib_peer_client->xa_cyclic_next,
> > + GFP_KERNEL);
> > + if (ret < 0)
> > + goto err_umem;
> > + }
> > +
> > + /*
> > + * We always request write permissions to the pages, to force breaking
> > + * of any CoW during the registration of the MR. For read-only MRs we
> > + * use the "force" flag to indicate that CoW breaking is required but
> > + * the registration should not fail if referencing read-only areas.
> > + */
> > + ret = ib_peer_client->peer_mem->get_pages(umem_p->umem.address,
> > + umem_p->umem.length, 1,
> > + !umem_p->umem.writable, NULL,
> > + peer_client_context,
> > + umem_p->xa_id);
> > + if (ret)
> > + goto err_xa;
> > +
> > + umem_p->umem.page_shift =
> > + ilog2(ib_peer_client->peer_mem->get_page_size(peer_client_context));
> > +
> > + ret = ib_peer_client->peer_mem->dma_map(&umem_p->umem.sg_head,
> > + peer_client_context,
> > + umem_p->umem.ibdev->dma_device,
> > + 0, &umem_p->umem.nmap);
> > + if (ret)
> > + goto err_pages;
> > +
> > + umem_p->mapped = true;
> > + atomic64_add(umem_p->umem.nmap, &ib_peer_client->stats.num_reg_pages);
> > + atomic64_add(umem_p->umem.length, &ib_peer_client->stats.num_reg_bytes);
> > + atomic64_inc(&ib_peer_client->stats.num_alloc_mrs);
> > +
> > + /*
> > + * If invalidation is allowed then the caller must call
> > + * ib_umem_activate_invalidation_notifier() or ib_peer_umem_release() to
> > + * unlock this mutex. The call to should be done after the last
> > + * read to sg_head, once the caller is ready for the invalidation
> > + * function to be called.
> > + */
> > + if (umem_p->xa_id == PEER_NO_INVALIDATION_ID)
> > + mutex_unlock(&umem_p->mapping_lock);
> > + /*
> > + * On success the old umem is replaced with the new, larger, allocation
> > + */
> > + kfree(old_umem);
> > + return &umem_p->umem;
> > +err_pages:
> > + ib_peer_client->peer_mem->put_pages(&umem_p->umem.sg_head,
> > + umem_p->peer_client_context);
> > +err_xa:
> > + if (umem_p->xa_id != PEER_NO_INVALIDATION_ID)
> > + xa_erase(&umem_p->ib_peer_client->umem_xa, umem_p->xa_id);
> > +err_umem:
> > + mutex_unlock(&umem_p->mapping_lock);
> > + kref_put(&umem_p->kref, ib_peer_umem_kref_release);
> > +err_client:
> > + ib_put_peer_client(ib_peer_client, peer_client_context);
> > + return ERR_PTR(ret);
> > +}
> > +
> > +void ib_peer_umem_release(struct ib_umem *umem)
> > +{
> > + struct ib_umem_peer *umem_p =
> > + container_of(umem, struct ib_umem_peer, umem);
> > +
> > + /* invalidation_func being set indicates activate was called */
> > + if (umem_p->xa_id == PEER_NO_INVALIDATION_ID ||
> > + umem_p->invalidation_func)
> > + mutex_lock(&umem_p->mapping_lock);
> > +
> > + if (umem_p->mapped)
> > + ib_unmap_peer_client(umem_p);
> > + mutex_unlock(&umem_p->mapping_lock);
> > +
> > + if (umem_p->xa_id != PEER_NO_INVALIDATION_ID)
> > + xa_erase(&umem_p->ib_peer_client->umem_xa, umem_p->xa_id);
> > + ib_put_peer_client(umem_p->ib_peer_client, umem_p->peer_client_context);
> > + umem_p->ib_peer_client = NULL;
> > +
> > + /* Must match ib_umem_release() */
> > + atomic64_sub(ib_umem_num_pages(umem), &umem->owning_mm->pinned_vm);
> > + mmdrop(umem->owning_mm);
> > +
> > + kref_put(&umem_p->kref, ib_peer_umem_kref_release);
> > +}
> > diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> > index 698c5359f643..e7473285e470 100644
> > --- a/drivers/infiniband/core/umem.c
> > +++ b/drivers/infiniband/core/umem.c
> > @@ -42,6 +42,7 @@
> > #include <rdma/ib_umem_odp.h>
> > #include "uverbs.h"
> > +#include "ib_peer_mem.h"
> > static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
> > {
> > @@ -193,15 +194,17 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
> > EXPORT_SYMBOL(ib_umem_find_best_pgsz);
> > /**
> > - * ib_umem_get - Pin and DMA map userspace memory.
> > + * __ib_umem_get - Pin and DMA map userspace memory.
> > *
> > * @device: IB device to connect UMEM
> > * @addr: userspace virtual address to start at
> > * @size: length of region to pin
> > * @access: IB_ACCESS_xxx flags for memory being pinned
> > + * @peer_mem_flags: IB_PEER_MEM_xxx flags for memory being used
> > */
> > -struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
> > - size_t size, int access)
> > +struct ib_umem *__ib_umem_get(struct ib_device *device,
> > + unsigned long addr, size_t size, int access,
> > + unsigned long peer_mem_flags)
> > {
> > struct ib_umem *umem;
> > struct page **page_list;
> > @@ -309,6 +312,24 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
> > umem_release:
> > __ib_umem_release(device, umem, 0);
> > + /*
> > + * If the address belongs to peer memory client, then the first
> > + * call to get_user_pages will fail. In this case, try to get
> > + * these pages from the peers.
> > + */
> > + //FIXME: this placement is horrible
> > + if (ret < 0 && peer_mem_flags & IB_PEER_MEM_ALLOW) {
> > + struct ib_umem *new_umem;
> > +
> > + new_umem = ib_peer_umem_get(umem, ret, peer_mem_flags);
> > + if (IS_ERR(new_umem)) {
> > + ret = PTR_ERR(new_umem);
> > + goto vma;
> > + }
> > + umem = new_umem;
> > + ret = 0;
> > + goto out;
> > + }
> > vma:
> > atomic64_sub(ib_umem_num_pages(umem), &mm->pinned_vm);
> > out:
> > @@ -320,8 +341,23 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
> > }
> > return ret ? ERR_PTR(ret) : umem;
> > }
> > +
> > +struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
> > + size_t size, int access)
> > +{
> > + return __ib_umem_get(device, addr, size, access, 0);
> > +}
> > EXPORT_SYMBOL(ib_umem_get);
> > +struct ib_umem *ib_umem_get_peer(struct ib_device *device, unsigned long addr,
> > + size_t size, int access,
> > + unsigned long peer_mem_flags)
> > +{
> > + return __ib_umem_get(device, addr, size, access,
> > + IB_PEER_MEM_ALLOW | peer_mem_flags);
> > +}
> > +EXPORT_SYMBOL(ib_umem_get_peer);
> > +
> > /**
> > * ib_umem_release - release memory pinned with ib_umem_get
> > * @umem: umem struct to release
> > @@ -333,6 +369,8 @@ void ib_umem_release(struct ib_umem *umem)
> > if (umem->is_odp)
> > return ib_umem_odp_release(to_ib_umem_odp(umem));
> > + if (umem->is_peer)
> > + return ib_peer_umem_release(umem);
> > __ib_umem_release(umem->ibdev, umem, 1);
> > atomic64_sub(ib_umem_num_pages(umem), &umem->owning_mm->pinned_vm);
> > diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
> > index 2f5ee37c252b..cd2241bb865a 100644
> > --- a/drivers/infiniband/hw/mlx5/cq.c
> > +++ b/drivers/infiniband/hw/mlx5/cq.c
> > @@ -733,8 +733,9 @@ static int create_cq_user(struct mlx5_ib_dev *dev, struct ib_udata *udata,
> > *cqe_size = ucmd.cqe_size;
> > cq->buf.umem =
> > - ib_umem_get(&dev->ib_dev, ucmd.buf_addr,
> > - entries * ucmd.cqe_size, IB_ACCESS_LOCAL_WRITE);
> > + ib_umem_get_peer(&dev->ib_dev, ucmd.buf_addr,
> > + entries * ucmd.cqe_size,
> > + IB_ACCESS_LOCAL_WRITE, 0);
> > if (IS_ERR(cq->buf.umem)) {
> > err = PTR_ERR(cq->buf.umem);
> > return err;
> > @@ -1132,9 +1133,9 @@ static int resize_user(struct mlx5_ib_dev *dev, struct mlx5_ib_cq *cq,
> > if (ucmd.cqe_size && SIZE_MAX / ucmd.cqe_size <= entries - 1)
> > return -EINVAL;
> > - umem = ib_umem_get(&dev->ib_dev, ucmd.buf_addr,
> > - (size_t)ucmd.cqe_size * entries,
> > - IB_ACCESS_LOCAL_WRITE);
> > + umem = ib_umem_get_peer(&dev->ib_dev, ucmd.buf_addr,
> > + (size_t)ucmd.cqe_size * entries,
> > + IB_ACCESS_LOCAL_WRITE, 0);
> > if (IS_ERR(umem)) {
> > err = PTR_ERR(umem);
> > return err;
> > diff --git a/drivers/infiniband/hw/mlx5/devx.c b/drivers/infiniband/hw/mlx5/devx.c
> > index c3b4b6586d17..f8f8507c7938 100644
> > --- a/drivers/infiniband/hw/mlx5/devx.c
> > +++ b/drivers/infiniband/hw/mlx5/devx.c
> > @@ -2143,7 +2143,7 @@ static int devx_umem_get(struct mlx5_ib_dev *dev, struct ib_ucontext *ucontext,
> > if (err)
> > return err;
> > - obj->umem = ib_umem_get(&dev->ib_dev, addr, size, access);
> > + obj->umem = ib_umem_get_peer(&dev->ib_dev, addr, size, access, 0);
> > if (IS_ERR(obj->umem))
> > return PTR_ERR(obj->umem);
> > diff --git a/drivers/infiniband/hw/mlx5/doorbell.c b/drivers/infiniband/hw/mlx5/doorbell.c
> > index 61475b571531..a2a7e121ee5f 100644
> > --- a/drivers/infiniband/hw/mlx5/doorbell.c
> > +++ b/drivers/infiniband/hw/mlx5/doorbell.c
> > @@ -64,8 +64,8 @@ int mlx5_ib_db_map_user(struct mlx5_ib_ucontext *context,
> > page->user_virt = (virt & PAGE_MASK);
> > page->refcnt = 0;
> > - page->umem = ib_umem_get(context->ibucontext.device, virt & PAGE_MASK,
> > - PAGE_SIZE, 0);
> > + page->umem = ib_umem_get_peer(context->ibucontext.device, virt & PAGE_MASK,
> > + PAGE_SIZE, 0, 0);
> > if (IS_ERR(page->umem)) {
> > err = PTR_ERR(page->umem);
> > kfree(page);
> > diff --git a/drivers/infiniband/hw/mlx5/mem.c b/drivers/infiniband/hw/mlx5/mem.c
> > index b5aece786b36..174567af5ddd 100644
> > --- a/drivers/infiniband/hw/mlx5/mem.c
> > +++ b/drivers/infiniband/hw/mlx5/mem.c
> > @@ -55,16 +55,17 @@ void mlx5_ib_cont_pages(struct ib_umem *umem, u64 addr,
> > int i = 0;
> > struct scatterlist *sg;
> > int entry;
> > + int page_shift = umem->is_peer ? umem->page_shift : PAGE_SHIFT;
> > - addr = addr >> PAGE_SHIFT;
> > + addr = addr >> page_shift;
> > tmp = (unsigned long)addr;
> > m = find_first_bit(&tmp, BITS_PER_LONG);
> > if (max_page_shift)
> > - m = min_t(unsigned long, max_page_shift - PAGE_SHIFT, m);
> > + m = min_t(unsigned long, max_page_shift - page_shift, m);
> > for_each_sg(umem->sg_head.sgl, sg, umem->nmap, entry) {
> > - len = sg_dma_len(sg) >> PAGE_SHIFT;
> > - pfn = sg_dma_address(sg) >> PAGE_SHIFT;
> > + len = sg_dma_len(sg) >> page_shift;
> > + pfn = sg_dma_address(sg) >> page_shift;
> > if (base + p != pfn) {
> > /* If either the offset or the new
> > * base are unaligned update m
> > @@ -96,7 +97,7 @@ void mlx5_ib_cont_pages(struct ib_umem *umem, u64 addr,
> > *ncont = 0;
> > }
> > - *shift = PAGE_SHIFT + m;
> > + *shift = page_shift + m;
> > *count = i;
> > }
> > diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
> > index 24daf420317e..2d075ca40bfc 100644
> > --- a/drivers/infiniband/hw/mlx5/mr.c
> > +++ b/drivers/infiniband/hw/mlx5/mr.c
> > @@ -41,6 +41,8 @@
> > #include <rdma/ib_verbs.h>
> > #include "mlx5_ib.h"
> > +static void mlx5_invalidate_umem(struct ib_umem *umem, void *priv);
> > +
> > enum {
> > MAX_PENDING_REG_MR = 8,
> > };
> > @@ -754,7 +756,7 @@ static int mr_cache_max_order(struct mlx5_ib_dev *dev)
> > static int mr_umem_get(struct mlx5_ib_dev *dev, u64 start, u64 length,
> > int access_flags, struct ib_umem **umem, int *npages,
> > - int *page_shift, int *ncont, int *order)
> > + int *page_shift, int *ncont, int *order, bool allow_peer)
> > {
> > struct ib_umem *u;
> > @@ -779,7 +781,13 @@ static int mr_umem_get(struct mlx5_ib_dev *dev, u64 start, u64 length,
> > if (order)
> > *order = ilog2(roundup_pow_of_two(*ncont));
> > } else {
> > - u = ib_umem_get(&dev->ib_dev, start, length, access_flags);
> > + if (allow_peer)
> > + u = ib_umem_get_peer(&dev->ib_dev, start, length,
> > + access_flags,
> > + IB_PEER_MEM_INVAL_SUPP);
> > + else
> > + u = ib_umem_get(&dev->ib_dev, start, length,
> > + access_flags);
> > if (IS_ERR(u)) {
> > mlx5_ib_dbg(dev, "umem get failed (%ld)\n", PTR_ERR(u));
> > return PTR_ERR(u);
> > @@ -1280,7 +1288,7 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
> > }
> > err = mr_umem_get(dev, start, length, access_flags, &umem,
> > - &npages, &page_shift, &ncont, &order);
> > + &npages, &page_shift, &ncont, &order, true);
> > if (err < 0)
> > return ERR_PTR(err);
> > @@ -1335,6 +1343,12 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
> > }
> > }
> > + if (umem->is_peer) {
> > + ib_umem_activate_invalidation_notifier(
> > + umem, mlx5_invalidate_umem, mr);
> > + /* After this point the MR can be invalidated */
> > + }
> > +
> > if (is_odp_mr(mr)) {
> > to_ib_umem_odp(mr->umem)->private = mr;
> > atomic_set(&mr->num_pending_prefetch, 0);
> > @@ -1412,6 +1426,10 @@ int mlx5_ib_rereg_user_mr(struct ib_mr *ib_mr, int flags, u64 start,
> > atomic_sub(mr->npages, &dev->mdev->priv.reg_pages);
> > + /* Peer memory isn't supported */
> > + if (mr->umem->is_peer)
> > + return -EOPNOTSUPP;
> > +
> > if (!mr->umem)
> > return -EINVAL;
> > @@ -1435,7 +1453,7 @@ int mlx5_ib_rereg_user_mr(struct ib_mr *ib_mr, int flags, u64 start,
> > ib_umem_release(mr->umem);
> > mr->umem = NULL;
> > err = mr_umem_get(dev, addr, len, access_flags, &mr->umem,
> > - &npages, &page_shift, &ncont, &order);
> > + &npages, &page_shift, &ncont, &order, false);
> > if (err)
> > goto err;
> > }
> > @@ -1615,13 +1633,14 @@ static void dereg_mr(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr)
> > * We should unregister the DMA address from the HCA before
> > * remove the DMA mapping.
> > */
> > - mlx5_mr_cache_free(dev, mr);
> > + if (mr->allocated_from_cache)
> > + mlx5_mr_cache_free(dev, mr);
> > + else
> > + kfree(mr);
> > +
> > ib_umem_release(umem);
> > if (umem)
> > atomic_sub(npages, &dev->mdev->priv.reg_pages);
> > -
> > - if (!mr->allocated_from_cache)
> > - kfree(mr);
> > }
> > int mlx5_ib_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
> > @@ -2331,3 +2350,15 @@ int mlx5_ib_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents,
> > return n;
> > }
> > +
> > +static void mlx5_invalidate_umem(struct ib_umem *umem, void *priv)
> > +{
> > + struct mlx5_ib_mr *mr = priv;
> > +
> > + /*
> > + * DMA is turned off for the mkey, but the mkey remains otherwise
> > + * untouched until the normal flow of dereg_mr happens. Any access to
> > + * this mkey will generate CQEs.
> > + */
> > + unreg_umr(mr->dev ,mr);
> > +}
> > diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
> > index 45faab9e1313..be59c6d5ba1c 100644
> > --- a/drivers/infiniband/hw/mlx5/qp.c
> > +++ b/drivers/infiniband/hw/mlx5/qp.c
> > @@ -749,7 +749,7 @@ static int mlx5_ib_umem_get(struct mlx5_ib_dev *dev, struct ib_udata *udata,
> > {
> > int err;
> > - *umem = ib_umem_get(&dev->ib_dev, addr, size, 0);
> > + *umem = ib_umem_get_peer(&dev->ib_dev, addr, size, 0, 0);
> > if (IS_ERR(*umem)) {
> > mlx5_ib_dbg(dev, "umem_get failed\n");
> > return PTR_ERR(*umem);
> > diff --git a/drivers/infiniband/hw/mlx5/srq.c b/drivers/infiniband/hw/mlx5/srq.c
> > index 6d1ff13d2283..2f55f7e1923d 100644
> > --- a/drivers/infiniband/hw/mlx5/srq.c
> > +++ b/drivers/infiniband/hw/mlx5/srq.c
> > @@ -80,7 +80,7 @@ static int create_srq_user(struct ib_pd *pd, struct mlx5_ib_srq *srq,
> > srq->wq_sig = !!(ucmd.flags & MLX5_SRQ_FLAG_SIGNATURE);
> > - srq->umem = ib_umem_get(pd->device, ucmd.buf_addr, buf_size, 0);
> > + srq->umem = ib_umem_get_peer(pd->device, ucmd.buf_addr, buf_size, 0, 0);
> > if (IS_ERR(srq->umem)) {
> > mlx5_ib_dbg(dev, "failed umem get, size %d\n", buf_size);
> > err = PTR_ERR(srq->umem);
> > diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
> > index 9353910915d4..ec9824cbf49d 100644
> > --- a/include/rdma/ib_umem.h
> > +++ b/include/rdma/ib_umem.h
> > @@ -48,10 +48,19 @@ struct ib_umem {
> > unsigned long address;
> > u32 writable : 1;
> > u32 is_odp : 1;
> > + /* Placing at the end of the bitfield list is ABI preserving on LE */
> > + u32 is_peer : 1;
> > struct work_struct work;
> > struct sg_table sg_head;
> > int nmap;
> > unsigned int sg_nents;
> > + unsigned int page_shift;
> > +};
> > +
> > +typedef void (*umem_invalidate_func_t)(struct ib_umem *umem, void *priv);
> > +enum ib_peer_mem_flags {
> > + IB_PEER_MEM_ALLOW = 1 << 0,
> > + IB_PEER_MEM_INVAL_SUPP = 1 << 1,
> > };
> > /* Returns the offset of the umem start relative to the first page. */
> > @@ -79,6 +88,13 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
> > unsigned long pgsz_bitmap,
> > unsigned long virt);
> > +struct ib_umem *ib_umem_get_peer(struct ib_device *device, unsigned long addr,
> > + size_t size, int access,
> > + unsigned long peer_mem_flags);
> > +void ib_umem_activate_invalidation_notifier(struct ib_umem *umem,
> > + umem_invalidate_func_t func,
> > + void *cookie);
> > +
> > #else /* CONFIG_INFINIBAND_USER_MEM */
> > #include <linux/err.h>
> > @@ -102,6 +118,19 @@ static inline unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
> > return 0;
> > }
> > +static inline struct ib_umem *ib_umem_get_peer(struct ib_device *device,
> > + unsigned long addr, size_t size,
> > + int access,
> > + unsigned long peer_mem_flags)
> > +{
> > + return ERR_PTR(-EINVAL);
> > +}
> > +
> > +static inline void ib_umem_activate_invalidation_notifier(
> > + struct ib_umem *umem, umem_invalidate_func_t func, void *cookie)
> > +{
> > +}
> > +
> > #endif /* CONFIG_INFINIBAND_USER_MEM */
> > #endif /* IB_UMEM_H */
> > diff --git a/include/rdma/peer_mem.h b/include/rdma/peer_mem.h
> > new file mode 100644
> > index 000000000000..563a820dbc32
> > --- /dev/null
> > +++ b/include/rdma/peer_mem.h
> > @@ -0,0 +1,165 @@
> > +/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
> > +/*
> > + * Copyright (c) 2014-2020, Mellanox Technologies. All rights reserved.
> > + */
> > +#ifndef RDMA_PEER_MEM_H
> > +#define RDMA_PEER_MEM_H
> > +
> > +#include <linux/scatterlist.h>
> > +
> > +#define IB_PEER_MEMORY_NAME_MAX 64
> > +#define IB_PEER_MEMORY_VER_MAX 16
> > +
> > +/*
> > + * Prior versions used a void * for core_context, at some point this was
> > + * switched to use u64. Be careful if compiling this as 32 bit. To help the
> > + * value of core_context is limited to u32 so it should work OK despite the
> > + * type change.
> > + */
> > +#define PEER_MEM_U64_CORE_CONTEXT
> > +
> > +struct device;
> > +
> > +/**
> > + * struct peer_memory_client - registration information for user virtual
> > + * memory handlers
> > + *
> > + * The peer_memory_client scheme allows a driver to register with the ib_umem
> > + * system that it has the ability to understand user virtual address ranges
> > + * that are not compatible with get_user_pages(). For instance VMAs created
> > + * with io_remap_pfn_range(), or other driver special VMA.
> > + *
> > + * For ranges the interface understands it can provide a DMA mapped sg_table
> > + * for use by the ib_umem, allowing user virtual ranges that cannot be
> > + * supported by get_user_pages() to be used as umems.
> > + */
> > +struct peer_memory_client {
> > + char name[IB_PEER_MEMORY_NAME_MAX];
> > + char version[IB_PEER_MEMORY_VER_MAX];
> > +
> > + /**
> > + * acquire - Begin working with a user space virtual address range
> > + *
> > + * @addr - Virtual address to be checked whether belongs to peer.
> > + * @size - Length of the virtual memory area starting at addr.
> > + * @peer_mem_private_data - Obsolete, always NULL
> > + * @peer_mem_name - Obsolete, always NULL
> > + * @client_context - Returns an opaque value for this acquire use in
> > + * other APIs
> > + *
> > + * Returns 1 if the peer_memory_client supports the entire virtual
> > + * address range, 0 or -ERRNO otherwise. If 1 is returned then
> > + * release() will be called to release the acquire().
> > + */
> > + int (*acquire)(unsigned long addr, size_t size,
> > + void *peer_mem_private_data, char *peer_mem_name,
> > + void **client_context);
> > + /**
> > + * get_pages - Fill in the first part of a sg_table for a virtual
> > + * address range
> > + *
> > + * @addr - Virtual address to be checked whether belongs to peer.
> > + * @size - Length of the virtual memory area starting at addr.
> > + * @write - Always 1
> > + * @force - 1 if write is required
> > + * @sg_head - Obsolete, always NULL
> > + * @client_context - Value returned by acquire()
> > + * @core_context - Value to be passed to invalidate_peer_memory for
> > + * this get
> > + *
> > + * addr/size are passed as the raw virtual address range requested by
> > + * the user, it is not aligned to any page size. get_pages() is always
> > + * followed by dma_map().
> > + *
> > + * Upon return the caller can call the invalidate_callback().
> > + *
> > + * Returns 0 on success, -ERRNO on failure. After success put_pages()
> > + * will be called to return the pages.
> > + */
> > + int (*get_pages)(unsigned long addr, size_t size, int write, int force,
> > + struct sg_table *sg_head, void *client_context,
> > + u64 core_context);
> > + /**
> > + * dma_map - Create a DMA mapped sg_table
> > + *
> > + * @sg_head - The sg_table to allocate
> > + * @client_context - Value returned by acquire()
> > + * @dma_device - The device that will be doing DMA from these addresses
> > + * @dmasync - Obsolete, always 0
> > + * @nmap - Returns the number of dma mapped entries in the sg_head
> > + *
> > + * Must be called after get_pages(). This must fill in the sg_head with
> > + * DMA mapped SGLs for dma_device. Each SGL start and end must meet a
> > + * minimum alignment of at least PAGE_SIZE, though individual sgls can
> > + * be multiples of PAGE_SIZE, in any mixture. Since the user virtual
> > + * address/size are not page aligned, the implementation must increase
> > + * it to the logical alignment when building the SGLs.
> > + *
> > + * Returns 0 on success, -ERRNO on failure. After success dma_unmap()
> > + * will be called to unmap the pages. On failure sg_head must be left
> > + * untouched or point to a valid sg_table.
> > + */
> > + int (*dma_map)(struct sg_table *sg_head, void *client_context,
> > + struct device *dma_device, int dmasync, int *nmap);
> > + /**
> > + * dma_unmap - Unmap a DMA mapped sg_table
> > + *
> > + * @sg_head - The sg_table to unmap
> > + * @client_context - Value returned by acquire()
> > + * @dma_device - The device that will be doing DMA from these addresses
> > + *
> > + * sg_head will not be touched after this function returns.
> > + *
> > + * Must return 0.
> > + */
> > + int (*dma_unmap)(struct sg_table *sg_head, void *client_context,
> > + struct device *dma_device);
> > + /**
> > + * put_pages - Unpin a SGL
> > + *
> > + * @sg_head - The sg_table to unpin
> > + * @client_context - Value returned by acquire()
> > + *
> > + * sg_head must be freed on return.
> > + */
> > + void (*put_pages)(struct sg_table *sg_head, void *client_context);
> > + /* Obsolete, not used */
> > + unsigned long (*get_page_size)(void *client_context);
> > + /**
> > + * release - Undo acquire
> > + *
> > + * @client_context - Value returned by acquire()
> > + *
> > + * If acquire() returns 1 then release() must be called. All
> > + * get_pages() and dma_map()'s must be undone before calling this
> > + * function.
> > + */
> > + void (*release)(void *client_context);
> > +};
> > +
> > +/*
> > + * If invalidate_callback() is non-NULL then the client will only support
> > + * umems which can be invalidated. The caller may call the
> > + * invalidate_callback() after acquire() on return the range will no longer
> > + * have DMA active, and release() will have been called.
> > + *
> > + * Note: The implementation locking must ensure that get_pages(), and
> > + * dma_map() do not have locking dependencies with invalidate_callback(). The
> > + * ib_core will wait until any concurrent get_pages() or dma_map() completes
> > + * before returning.
> > + *
> > + * Similarly, this can call dma_unmap(), put_pages() and release() from within
> > + * the callback, or will wait for another thread doing those operations to
> > + * complete.
> > + *
> > + * For these reasons the user of invalidate_callback() must be careful with
> > + * locking.
> > + */
> > +typedef int (*invalidate_peer_memory)(void *reg_handle, u64 core_context);
> > +
> > +void *
> > +ib_register_peer_memory_client(const struct peer_memory_client *peer_client,
> > + invalidate_peer_memory *invalidate_callback);
> > +void ib_unregister_peer_memory_client(void *reg_handle);
> > +
> > +#endif
> >
>
>
More information about the kernel-team
mailing list