ACK: [SRU Bionic 1/1] gup: document and work around "COW can break either way" issue

William Breathitt Gray william.gray at canonical.com
Fri Jan 22 00:51:14 UTC 2021


On Thu, Jan 21, 2021 at 05:40:10PM -0300, Thadeu Lima de Souza Cascardo wrote:
> On Wed, Jan 13, 2021 at 05:54:06AM -0500, William Breathitt Gray wrote:
> > On Wed, Dec 16, 2020 at 10:25:57PM -0300, Thadeu Lima de Souza Cascardo wrote:
> > > From: Linus Torvalds <torvalds at linux-foundation.org>
> > > 
> > > Doing a "get_user_pages()" on a copy-on-write page for reading can be
> > > ambiguous: the page can be COW'ed at any time afterwards, and the
> > > direction of a COW event isn't defined.
> > > 
> > > Yes, whoever writes to it will generally do the COW, but if the thread
> > > that did the get_user_pages() unmapped the page before the write (and
> > > that could happen due to memory pressure in addition to any outright
> > > action), the writer could also just take over the old page instead.
> > > 
> > > End result: the get_user_pages() call might result in a page pointer
> > > that is no longer associated with the original VM, and is associated
> > > with - and controlled by - another VM having taken it over instead.
> > > 
> > > So when doing a get_user_pages() on a COW mapping, the only really safe
> > > thing to do would be to break the COW when getting the page, even when
> > > only getting it for reading.
> > > 
> > > At the same time, some users simply don't even care.
> > > 
> > > For example, the perf code wants to look up the page not because it
> > > cares about the page, but because the code simply wants to look up the
> > > physical address of the access for informational purposes, and doesn't
> > > really care about races when a page might be unmapped and remapped
> > > elsewhere.
> > > 
> > > This adds logic to force a COW event by setting FOLL_WRITE on any
> > > copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page
> > > pointer as a result.
> > > 
> > > The current semantics end up being:
> > > 
> > >  - __get_user_pages_fast(): no change. If you don't ask for a write,
> > >    you won't break COW. You'd better know what you're doing.
> > > 
> > >  - get_user_pages_fast(): the fast-case "look it up in the page tables
> > >    without anything getting mmap_sem" now refuses to follow a read-only
> > >    page, since it might need COW breaking.  Which happens in the slow
> > >    path - the fast path doesn't know if the memory might be COW or not.
> > > 
> > >  - get_user_pages() (including the slow-path fallback for gup_fast()):
> > >    for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with
> > >    very similar semantics to FOLL_FORCE.
> > > 
> > > If it turns out that we want finer granularity (ie "only break COW when
> > > it might actually matter" - things like the zero page are special and
> > > don't need to be broken) we might need to push these semantics deeper
> > > into the lookup fault path.  So if people care enough, it's possible
> > > that we might end up adding a new internal FOLL_BREAK_COW flag to go
> > > with the internal FOLL_COW flag we already have for tracking "I had a
> > > COW".
> > > 
> > > Alternatively, if it turns out that different callers might want to
> > > explicitly control the forced COW break behavior, we might even want to
> > > make such a flag visible to the users of get_user_pages() instead of
> > > using the above default semantics.
> > > 
> > > But for now, this is mostly commentary on the issue (this commit message
> > > being a lot bigger than the patch, and that patch in turn is almost all
> > > comments), with that minimal "enable COW breaking early" logic using the
> > > existing FOLL_WRITE behavior.
> > > 
> > > [ It might be worth noting that we've always had this ambiguity, and it
> > >   could arguably be seen as a user-space issue.
> > > 
> > >   You only get private COW mappings that could break either way in
> > >   situations where user space is doing cooperative things (ie fork()
> > >   before an execve() etc), but it _is_ surprising and very subtle, and
> > >   fork() is supposed to give you independent address spaces.
> > > 
> > >   So let's treat this as a kernel issue and make the semantics of
> > >   get_user_pages() easier to understand. Note that obviously a true
> > >   shared mapping will still get a page that can change under us, so this
> > >   does _not_ mean that get_user_pages() somehow returns any "stable"
> > >   page ]
> > > 
> > > Reported-by: Jann Horn <jannh at google.com>
> > > Tested-by: Christoph Hellwig <hch at lst.de>
> > > Acked-by: Oleg Nesterov <oleg at redhat.com>
> > > Acked-by: Kirill Shutemov <kirill at shutemov.name>
> > > Acked-by: Jan Kara <jack at suse.cz>
> > > Cc: Andrea Arcangeli <aarcange at redhat.com>
> > > Cc: Matthew Wilcox <willy at infradead.org>
> > > Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org>
> > > (backported from commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f)
> > > [cascardo: fixed conflict due to missing commit ad415db817964e96df824e8bb1a861527f8012b6]
> > > [cascardo: fixed conflict due to missing commit b798bec4741bdd80224214fdd004c8e52698e425]
> > > [cascardo: fixed minor conflict on __get_user_pages_fast comment]
> > > [cascardo: fixup for absence of FOLL_PIN]
> > > [cascardo: use write=1 on s390's get_user_pages_fast too]
> > > CVE-2020-29374
> > > Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo at canonical.com>
> > > ---
> > >  arch/s390/mm/gup.c                      |  9 ++++-
> > >  drivers/gpu/drm/i915/i915_gem_userptr.c |  8 +++++
> > >  mm/gup.c                                | 44 +++++++++++++++++++++----
> > >  mm/huge_memory.c                        |  7 ++--
> > >  4 files changed, 57 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
> > > index 9bce54eac0b0..713cf80a740e 100644
> > > --- a/arch/s390/mm/gup.c
> > > +++ b/arch/s390/mm/gup.c
> > > @@ -283,9 +283,16 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
> > >  {
> > >  	int nr, ret;
> > >  
> > > +	/*
> > > +	 * The FAST_GUP case requires FOLL_WRITE even for pure reads,
> > > +	 * because get_user_pages() may need to cause an early COW in
> > > +	 * order to avoid confusing the normal COW routines. So only
> > > +	 * targets that are already writable are safe to do by just
> > > +	 * looking at the page tables.
> > > +	 */
> > >  	might_sleep();
> > >  	start &= PAGE_MASK;
> > > -	nr = __get_user_pages_fast(start, nr_pages, write, pages);
> > > +	nr = __get_user_pages_fast(start, nr_pages, 1, pages);
> > >  	if (nr == nr_pages)
> > >  		return nr;
> > >  
> > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > index f1bd66b5f006..d87beef8ceb3 100644
> > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > @@ -640,6 +640,14 @@ static int i915_gem_userptr_get_pages(struct drm_i915_gem_object *obj)
> > >  				      GFP_KERNEL |
> > >  				      __GFP_NORETRY |
> > >  				      __GFP_NOWARN);
> > > +		/*
> > > +		 * Using __get_user_pages_fast() with a read-only
> > > +		 * access is questionable. A read-only page may be
> > > +		 * COW-broken, and then this might end up giving
> > > +		 * the wrong side of the COW..
> > > +		 *
> > > +		 * We may or may not care.
> > > +		 */
> > >  		if (pvec) /* defer to worker if malloc fails */
> > >  			pinned = __get_user_pages_fast(obj->userptr.ptr,
> > >  						       num_pages,
> > > diff --git a/mm/gup.c b/mm/gup.c
> > > index 12b9626b1a9e..ebf4a3482dee 100644
> > > --- a/mm/gup.c
> > > +++ b/mm/gup.c
> > > @@ -61,13 +61,22 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
> > >  }
> > >  
> > >  /*
> > > - * FOLL_FORCE can write to even unwritable pte's, but only
> > > - * after we've gone through a COW cycle and they are dirty.
> > > + * FOLL_FORCE or a forced COW break can write even to unwritable pte's,
> > > + * but only after we've gone through a COW cycle and they are dirty.
> > >   */
> > >  static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
> > >  {
> > > -	return pte_write(pte) ||
> > > -		((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
> > > +	return pte_write(pte) || ((flags & FOLL_COW) && pte_dirty(pte));
> > > +}
> > > +
> > > +/*
> > > + * A (separate) COW fault might break the page the other way and
> > > + * get_user_pages() would return the page from what is now the wrong
> > > + * VM. So we need to force a COW break at GUP time even for reads.
> > > + */
> > > +static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned int flags)
> > > +{
> > > +	return is_cow_mapping(vma->vm_flags) && (flags & (FOLL_GET));
> > >  }
> > >  
> > >  static struct page *follow_page_pte(struct vm_area_struct *vma,
> > > @@ -694,12 +703,18 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> > >  			if (!vma || check_vma_flags(vma, gup_flags))
> > >  				return i ? : -EFAULT;
> > >  			if (is_vm_hugetlb_page(vma)) {
> > > +				if (should_force_cow_break(vma, foll_flags))
> > > +					foll_flags |= FOLL_WRITE;
> > >  				i = follow_hugetlb_page(mm, vma, pages, vmas,
> > >  						&start, &nr_pages, i,
> > > -						gup_flags, nonblocking);
> > > +						foll_flags, nonblocking);
> > >  				continue;
> > >  			}
> > >  		}
> > > +
> > > +		if (should_force_cow_break(vma, foll_flags))
> > > +			foll_flags |= FOLL_WRITE;
> > > +
> > >  retry:
> > >  		/*
> > >  		 * If we have a pending SIGKILL, don't keep faulting pages and
> > > @@ -1796,6 +1811,10 @@ bool gup_fast_permitted(unsigned long start, int nr_pages, int write)
> > >  /*
> > >   * Like get_user_pages_fast() except it's IRQ-safe in that it won't fall back to
> > >   * the regular GUP. It will only return non-negative values.
> > > + *
> > > + * Careful, careful! COW breaking can go either way, so a non-write
> > > + * access can get ambiguous page results. If you call this function without
> > > + * 'write' set, you'd better be sure that you're ok with that ambiguity.
> > >   */
> > >  int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
> > >  			  struct page **pages)
> > > @@ -1823,6 +1842,12 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
> > >  	 *
> > >  	 * We do not adopt an rcu_read_lock(.) here as we also want to
> > >  	 * block IPIs that come from THPs splitting.
> > > +	 *
> > > +	 * NOTE! We allow read-only gup_fast() here, but you'd better be
> > > +	 * careful about possible COW pages. You'll get _a_ COW page, but
> > > +	 * not necessarily the one you intended to get depending on what
> > > +	 * COW event happens after this. COW may break the page copy in a
> > > +	 * random direction.
> > >  	 */
> > >  
> > >  	if (gup_fast_permitted(start, nr_pages, write)) {
> > > @@ -1868,9 +1893,16 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
> > >  					(void __user *)start, len)))
> > >  		return -EFAULT;
> > >  
> > > +	/*
> > > +	 * The FAST_GUP case requires FOLL_WRITE even for pure reads,
> > > +	 * because get_user_pages() may need to cause an early COW in
> > > +	 * order to avoid confusing the normal COW routines. So only
> > > +	 * targets that are already writable are safe to do by just
> > > +	 * looking at the page tables.
> > > +	 */
> > >  	if (gup_fast_permitted(start, nr_pages, write)) {
> > >  		local_irq_disable();
> > > -		gup_pgd_range(addr, end, write, pages, &nr);
> > > +		gup_pgd_range(addr, end, 1, pages, &nr);
> > >  		local_irq_enable();
> > >  		ret = nr;
> > >  	}
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index b0a6ba0833e4..9f8fbb504d15 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -1425,13 +1425,12 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
> > >  }
> > >  
> > >  /*
> > > - * FOLL_FORCE can write to even unwritable pmd's, but only
> > > - * after we've gone through a COW cycle and they are dirty.
> > > + * FOLL_FORCE or a forced COW break can write even to unwritable pmd's,
> > > + * but only after we've gone through a COW cycle and they are dirty.
> > >   */
> > >  static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
> > >  {
> > > -	return pmd_write(pmd) ||
> > > -	       ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
> > > +	return pmd_write(pmd) || ((flags & FOLL_COW) && pmd_dirty(pmd));
> > 
> > The comment above this function mentions FOLL_FORCE, but I see it
> > removed here. Is that removal intentional?
> 
> Yep, that comes from the upstream code.
> 
> Cascardo.

You're right, upstream patch removes this.

Acked-by: William Breathitt Gray <william.gray at canonical.com>

> 
> > 
> > William Breathitt Gray
> > 
> > >  }
> > >  
> > >  struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
> > > -- 
> > > 2.27.0
> > > 
> > > 
> > > -- 
> > > kernel-team mailing list
> > > kernel-team at lists.ubuntu.com
> > > https://lists.ubuntu.com/mailman/listinfo/kernel-team
> 
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20210121/e08966bc/attachment.sig>


More information about the kernel-team mailing list