NACK: [PATCH][B][aws] UBUNTU SAUCE: mm: swap: improve swap readahead heuristic

Wed Dec 11 21:06:16 UTC 2019

On Wed, Dec 11, 2019 at 01:00:02PM -0800, Sultan Alsawaf wrote:
> On Tue, Dec 03, 2019 at 11:58:59AM +0100, Andrea Righi wrote:
> > BugLink: https://bugs.launchpad.net/bugs/1831940
> > 
> > Apply a more aggressive swapin readahead policy to improve swapoff
> > performance.
> > 
> > The idea is to start with no readahead (only read one page) and linearly
> > increment the amount of readahead pages each time swapin_readahead() is
> > called, up to the maximum cluster size (defined by vm.page-cluster),
> > then go back to one page to give the disk enough time to prefetch the
> > requested pages and avoid re-requesting them multiple times.
> > 
> > Also increase the default vm.page-cluster size to 8 (that seems to work
> > better with this new heuristic).
> > 
> > Signed-off-by: Andrea Righi <andrea.righi at canonical.com>
> > ---
> >  mm/swap.c       |  2 +-
> >  mm/swap_state.c | 60 ++++++++-----------------------------------------
> >  2 files changed, 10 insertions(+), 52 deletions(-)
> > 
> > diff --git a/mm/swap.c b/mm/swap.c
> > index abc82e6c14d1..5603bc987ef0 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -1022,7 +1022,7 @@ void __init swap_setup(void)
> >  	if (megs < 16)
> >  		page_cluster = 2;
> >  	else
> > -		page_cluster = 3;
> > +		page_cluster = 8;
> >  	/*
> >  	 * Right now other parts of the system means that we
> >  	 * _really_ don't want to cluster much more
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 6dac8c6ee6d9..a2246bcebc77 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -472,62 +472,21 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> >  	return retpage;
> >  }
> >  
> > -static unsigned int __swapin_nr_pages(unsigned long prev_offset,
> > -				      unsigned long offset,
> > -				      int hits,
> > -				      int max_pages,
> > -				      int prev_win)
> > -{
> > -	unsigned int pages, last_ra;
> > -
> > -	/*
> > -	 * This heuristic has been found to work well on both sequential and
> > -	 * random loads, swapping to hard disk or to SSD: please don't ask
> > -	 * what the "+ 2" means, it just happens to work well, that's all.
> > -	 */
> > -	pages = hits + 2;
> > -	if (pages == 2) {
> > -		/*
> > -		 * We can have no readahead hits to judge by: but must not get
> > -		 * stuck here forever, so check for an adjacent offset instead
> > -		 * (and don't even bother to check whether swap type is same).
> > -		 */
> > -		if (offset != prev_offset + 1 && offset != prev_offset - 1)
> > -			pages = 1;
> > -	} else {
> > -		unsigned int roundup = 4;
> > -		while (roundup < pages)
> > -			roundup <<= 1;
> > -		pages = roundup;
> > -	}
> > -
> > -	if (pages > max_pages)
> > -		pages = max_pages;
> > -
> > -	/* Don't shrink readahead too fast */
> > -	last_ra = prev_win / 2;
> > -	if (pages < last_ra)
> > -		pages = last_ra;
> > -
> > -	return pages;
> > -}
> > -
> >  static unsigned long swapin_nr_pages(unsigned long offset)
> >  {
> > -	static unsigned long prev_offset;
> > -	unsigned int hits, pages, max_pages;
> > -	static atomic_t last_readahead_pages;
> > +	static unsigned int prev_pages;
> > +	unsigned long pages, max_pages;
> >  
> >  	max_pages = 1 << READ_ONCE(page_cluster);
> >  	if (max_pages <= 1)
> >  		return 1;
> >  
> > -	hits = atomic_xchg(&swapin_readahead_hits, 0);
> > -	pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages,
> > -				  atomic_read(&last_readahead_pages));
> > -	if (!hits)
> > -		prev_offset = offset;
> > -	atomic_set(&last_readahead_pages, pages);
> > +	pages = READ_ONCE(prev_pages) + 1;
> > +	if (pages > max_pages) {
> > +		WRITE_ONCE(prev_pages, 0);
> > +		pages = max_pages;
> > +	} else
> > +		WRITE_ONCE(prev_pages, pages);
> >  
> >  	return pages;
> >  }
> > @@ -684,8 +643,7 @@ struct page *swap_readahead_detect(struct vm_fault *vmf,
> >  	pfn = PFN_DOWN(SWAP_RA_ADDR(swap_ra_info));
> >  	prev_win = SWAP_RA_WIN(swap_ra_info);
> >  	hits = SWAP_RA_HITS(swap_ra_info);
> > -	swap_ra->win = win = __swapin_nr_pages(pfn, fpfn, hits,
> > -					       max_win, prev_win);
> > +	swap_ra->win = win = swapin_nr_pages(fpfn);
> >  	atomic_long_set(&vma->swap_readahead_info,
> >  			SWAP_RA_VAL(faddr, win, 0));
> >  
> > -- 
> > 2.17.1
> > 
> > 
> > -- 
> > kernel-team mailing list
> > kernel-team at lists.ubuntu.com
> > https://lists.ubuntu.com/mailman/listinfo/kernel-team
> 
> Something is not quite right in swapin_nr_pages(). The use of *_ONCE() suggests
> that swapin_nr_pages() can be executed concurrently or in parallel on different
> CPUs, in which cases there are major synchronization issues.
> 
> In the case of this function running concurrently, reads and writes of
> `prev_pages` can be interleaved, which is probably not desired. In the case of
> this function running in parallel on different CPUs, the updated value in
> `prev_pages` on one CPU will not be reflected on another CPU due to a lack of
> explicit memory barriers to guarantee multicopy atomicity. The atomic ops that
> imply a memory barrier, guaranteeing multicopy atomicity, are the ones which
> return a value (like atomic_xchg() and atomic_cmpxchg()); the others (like
> atomic_read() and atomic_set()) do not.
> 
> If swapin_nr_pages() never executes concurrently or in parallel, then this patch
> is safe as-is and the use of *_ONCE() should be removed. Otherwise, the body of
> swapin_nr_pages() should either be converted into an atomic_cmpxchg() loop, or a
> spin lock should be used.
> 
> Nacked-by: Sultan Alsawaf <sultan.alsawaf at canonical.com>

Forgot to mention that the *_ONCE() macros do not imply memory barriers. They
alone are not enough for multicopy atomicity, though they're great for deterring
compiler mischief where the compiler might insert multiple reads/writes when
only one is desired, though I don't think that usecase applies here...

Sultan