NACK: [PATCH][B][aws] UBUNTU SAUCE: mm: swap: improve swap readahead heuristic

Thu Dec 12 07:21:53 UTC 2019

On Wed, Dec 11, 2019 at 01:06:16PM -0800, Sultan Alsawaf wrote:
> On Wed, Dec 11, 2019 at 01:00:02PM -0800, Sultan Alsawaf wrote:
> > On Tue, Dec 03, 2019 at 11:58:59AM +0100, Andrea Righi wrote:
> > > BugLink: https://bugs.launchpad.net/bugs/1831940
> > > 
> > > Apply a more aggressive swapin readahead policy to improve swapoff
> > > performance.
> > > 
> > > The idea is to start with no readahead (only read one page) and linearly
> > > increment the amount of readahead pages each time swapin_readahead() is
> > > called, up to the maximum cluster size (defined by vm.page-cluster),
> > > then go back to one page to give the disk enough time to prefetch the
> > > requested pages and avoid re-requesting them multiple times.
> > > 
> > > Also increase the default vm.page-cluster size to 8 (that seems to work
> > > better with this new heuristic).
> > > 
> > > Signed-off-by: Andrea Righi <andrea.righi at canonical.com>
> > > ---
> > >  mm/swap.c       |  2 +-
> > >  mm/swap_state.c | 60 ++++++++-----------------------------------------
> > >  2 files changed, 10 insertions(+), 52 deletions(-)
> > > 
> > > diff --git a/mm/swap.c b/mm/swap.c
> > > index abc82e6c14d1..5603bc987ef0 100644
> > > --- a/mm/swap.c
> > > +++ b/mm/swap.c
> > > @@ -1022,7 +1022,7 @@ void __init swap_setup(void)
> > >  	if (megs < 16)
> > >  		page_cluster = 2;
> > >  	else
> > > -		page_cluster = 3;
> > > +		page_cluster = 8;
> > >  	/*
> > >  	 * Right now other parts of the system means that we
> > >  	 * _really_ don't want to cluster much more
> > > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > > index 6dac8c6ee6d9..a2246bcebc77 100644
> > > --- a/mm/swap_state.c
> > > +++ b/mm/swap_state.c
> > > @@ -472,62 +472,21 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > >  	return retpage;
> > >  }
> > >  
> > > -static unsigned int __swapin_nr_pages(unsigned long prev_offset,
> > > -				      unsigned long offset,
> > > -				      int hits,
> > > -				      int max_pages,
> > > -				      int prev_win)
> > > -{
> > > -	unsigned int pages, last_ra;
> > > -
> > > -	/*
> > > -	 * This heuristic has been found to work well on both sequential and
> > > -	 * random loads, swapping to hard disk or to SSD: please don't ask
> > > -	 * what the "+ 2" means, it just happens to work well, that's all.
> > > -	 */
> > > -	pages = hits + 2;
> > > -	if (pages == 2) {
> > > -		/*
> > > -		 * We can have no readahead hits to judge by: but must not get
> > > -		 * stuck here forever, so check for an adjacent offset instead
> > > -		 * (and don't even bother to check whether swap type is same).
> > > -		 */
> > > -		if (offset != prev_offset + 1 && offset != prev_offset - 1)
> > > -			pages = 1;
> > > -	} else {
> > > -		unsigned int roundup = 4;
> > > -		while (roundup < pages)
> > > -			roundup <<= 1;
> > > -		pages = roundup;
> > > -	}
> > > -
> > > -	if (pages > max_pages)
> > > -		pages = max_pages;
> > > -
> > > -	/* Don't shrink readahead too fast */
> > > -	last_ra = prev_win / 2;
> > > -	if (pages < last_ra)
> > > -		pages = last_ra;
> > > -
> > > -	return pages;
> > > -}
> > > -
> > >  static unsigned long swapin_nr_pages(unsigned long offset)
> > >  {
> > > -	static unsigned long prev_offset;
> > > -	unsigned int hits, pages, max_pages;
> > > -	static atomic_t last_readahead_pages;
> > > +	static unsigned int prev_pages;
> > > +	unsigned long pages, max_pages;
> > >  
> > >  	max_pages = 1 << READ_ONCE(page_cluster);
> > >  	if (max_pages <= 1)
> > >  		return 1;
> > >  
> > > -	hits = atomic_xchg(&swapin_readahead_hits, 0);
> > > -	pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages,
> > > -				  atomic_read(&last_readahead_pages));
> > > -	if (!hits)
> > > -		prev_offset = offset;
> > > -	atomic_set(&last_readahead_pages, pages);
> > > +	pages = READ_ONCE(prev_pages) + 1;
> > > +	if (pages > max_pages) {
> > > +		WRITE_ONCE(prev_pages, 0);
> > > +		pages = max_pages;
> > > +	} else
> > > +		WRITE_ONCE(prev_pages, pages);
> > >  
> > >  	return pages;
> > >  }
> > > @@ -684,8 +643,7 @@ struct page *swap_readahead_detect(struct vm_fault *vmf,
> > >  	pfn = PFN_DOWN(SWAP_RA_ADDR(swap_ra_info));
> > >  	prev_win = SWAP_RA_WIN(swap_ra_info);
> > >  	hits = SWAP_RA_HITS(swap_ra_info);
> > > -	swap_ra->win = win = __swapin_nr_pages(pfn, fpfn, hits,
> > > -					       max_win, prev_win);
> > > +	swap_ra->win = win = swapin_nr_pages(fpfn);
> > >  	atomic_long_set(&vma->swap_readahead_info,
> > >  			SWAP_RA_VAL(faddr, win, 0));
> > >  
> > > -- 
> > > 2.17.1
> > > 
> > > 
> > > -- 
> > > kernel-team mailing list
> > > kernel-team at lists.ubuntu.com
> > > https://lists.ubuntu.com/mailman/listinfo/kernel-team
> > 
> > Something is not quite right in swapin_nr_pages(). The use of *_ONCE() suggests
> > that swapin_nr_pages() can be executed concurrently or in parallel on different
> > CPUs, in which cases there are major synchronization issues.
> > 
> > In the case of this function running concurrently, reads and writes of
> > `prev_pages` can be interleaved, which is probably not desired. In the case of
> > this function running in parallel on different CPUs, the updated value in
> > `prev_pages` on one CPU will not be reflected on another CPU due to a lack of
> > explicit memory barriers to guarantee multicopy atomicity. The atomic ops that
> > imply a memory barrier, guaranteeing multicopy atomicity, are the ones which
> > return a value (like atomic_xchg() and atomic_cmpxchg()); the others (like
> > atomic_read() and atomic_set()) do not.
> > 
> > If swapin_nr_pages() never executes concurrently or in parallel, then this patch
> > is safe as-is and the use of *_ONCE() should be removed. Otherwise, the body of
> > swapin_nr_pages() should either be converted into an atomic_cmpxchg() loop, or a
> > spin lock should be used.
> > 
> > Nacked-by: Sultan Alsawaf <sultan.alsawaf at canonical.com>
> 
> Forgot to mention that the *_ONCE() macros do not imply memory barriers. They
> alone are not enough for multicopy atomicity, though they're great for deterring
> compiler mischief where the compiler might insert multiple reads/writes when
> only one is desired, though I don't think that usecase applies here...

As already mentioned on IRC (but also reporting here for completeness)
we don't really need to be extra precise here, at the end it's just an
heuristic to determine the size of the pages to prefetch and the idea is
simply to have a more aggressive prefetching.

The idea of the *_ONCE() was to make sure the individual reads/writes of
prev_pages were consistent, but I don't expect surprises if we just drop
them.

So, I'll repeat the tests removing those and if everything is working as
expected I'll send a v2.

Thanks for the review!
-Andrea