ACK/Cmnt: [SRU][F][PATCH 0/3] bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

Stefan Bader stefan.bader at canonical.com
Wed Oct 21 07:54:44 UTC 2020


On 21.10.20 03:58, Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/1898786
> 
> [Impact]
> 
> Systems that utilise bcache can experience extremely high IO wait times when
> under constant IO pressure. The IO wait times seem to stay at a consistent 1
> second, and never drop as long as the bcache shrinker is enabled.
> 
> If you disable the shrinker, then IO wait drops significantly to normal levels.
> 
> We did some perf analysis, and it seems we spend a huge amount of time in
> bch_mca_scan(), likely waiting for the mutex "&c->bucket_lock".
> 
> Looking at the recent commits in Bionic, we found the following commit merged in
> upstream 5.1-rc1 and backported to 4.15.0-87-generic through upstream stable:
> 
> commit 9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b
> Author: Coly Li <colyli at suse.de>
> Date: Wed Nov 13 16:03:24 2019 +0800
> Subject: bcache: at least try to shrink 1 node in bch_mca_scan()
> Link: https://github.com/torvalds/linux/commit/9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b 
> 
> It mentions in the description that:
> 
>> If sc->nr_to_scan is smaller than c->btree_pages, after the above
>> calculation, variable 'nr' will be 0 and nothing will be shrunk. It is
>> frequeently observed that only 1 or 2 is set to sc->nr_to_scan and make
>> nr to be zero. Then bch_mca_scan() will do nothing more then acquiring
>> and releasing mutex c->bucket_lock. 
> 
> This seems to be what is going on here, but the above commit only addresses when
> nr is 0.
> 
> From what I can see, the problems we are experiencing are when nr is 1 or 2, and
> again, we just waste time in bch_mca_scan() waiting on c->bucket_lock, only to
> release c->bucket_lock since the shrinker loop never executes since there is no
> work to do.
> 
> [Fix]
> 
> The following commits fix the problem, and all landed in 5.6-rc1:
> 
> commit 125d98edd11464c8e0ec9eaaba7d682d0f832686
> Author: Coly Li <colyli at suse.de>
> Date: Fri Jan 24 01:01:40 2020 +0800
> Subject: bcache: remove member accessed from struct btree
> Link: https://github.com/torvalds/linux/commit/125d98edd11464c8e0ec9eaaba7d682d0f832686
> 
> commit d5c9c470b01177e4d90cdbf178b8c7f37f5b8795
> Author: Coly Li <colyli at suse.de>
> Date: Fri Jan 24 01:01:41 2020 +0800
> Subject: bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan()
> Link: https://github.com/torvalds/linux/commit/d5c9c470b01177e4d90cdbf178b8c7f37f5b8795
> 
> commit e3de04469a49ee09c89e80bf821508df458ccee6
> Author: Coly Li <colyli at suse.de>
> Date: Fri Jan 24 01:01:42 2020 +0800
> Subject: bcache: reap from tail of c->btree_cache in bch_mca_scan()
> Link: https://github.com/torvalds/linux/commit/e3de04469a49ee09c89e80bf821508df458ccee6 
> 
> The first commit is a dependency of the other two. The first commit removes a
> "recently accessed" marker, used to indicate if a particular cache has been used
> recently, and if it has been, not consider it for cache eviction. The commit
> mentions that under heavy IO, all caches will end up being recently accessed,
> and nothing will ever be shrunk.
> 
> The second commit changes a previous design decision of skipping the first
> 3 caches to shrink, since it is a common case to call bch_mca_scan() with nr
> being 1, or 2, just like 0 was common in the very first commit I mentioned.
> This time, if we land on 1 or 2, the loop exits and nothing happens, and we
> waste time waiting on locks, just like the very first commit I mentioned.
> The fix is to try shrink caches from the tail of the list, and not the beginning.
> 
> The third commit fixes a minor issue where we don't want to re-arrange the linked
> list c->btree_cache, which is what the second commit ended up doing, and instead, 
> just shrink the cache at the end of the linked list, and not change the order.
> 
> One minor backport / context adjustment was required in the first commit for
> Bionic, and the rest are all clean cherry picks to Bionic and Focal.
> 
> [Testcase]
> 
> This is kind of hard to test, since the problem shows up in production
> environments that are under constant IO pressure, with many different items 
> entering and leaving the cache.
> 
> The Launchpad git server is currently suffering this issue, and has been sitting
> at a constant IO wait of 1 second / slightly over 1 second which was causing
> slow response times, which was leading to build failures when git clones ended
> up timing out.
> 
> We installed a test kernel, which is available in the following PPA:
> 
> https://launchpad.net/~mruffell/+archive/ubuntu/sf294907-test 
> 
> Once the test kernel had been installed, IO wait times with the shrinker enabled
> dropped to normal levels, and the git server became responsive again. We have
> been monitoring the performance of the git server and watching IO wait times
> in grafana over the past week, and everything is looking good, and indicate that
> these patches solve the issue.
> 
> [Regression Potential]
> 
> If a regression were to occur, it would only affect users of bcache who have the
> shrinker enabled, which I believe is the default setting in bcache deployments.
> 
> In that case, users could disable the shrinker as a workaround until a fix is
> available. You can disable the shrinker with: 
> "sudo bash -c "echo 1 > /sys/fs/bcache/<uuid>/internal/btree_shrinker_disabled""
> 
> The patches remove the "recently accessed" marker, which changes the behaviour
> slightly when it comes to deciding what to evict from the cache. Some workloads
> that aren't under IO pressure may see a difference in what items are evicted
> from the cache, which may slightly change bcache's performance profile. 
> 
> Since the patches prune the cache from the tail of the cache linked list, we
> will likely end up pruning more nodes than previously, since it was common to
> land on the first 3 caches, which were hard coded to be ignored and not pruned,
> which may also slightly change the bcache's performance profile.
> 
> Coly Li (3):
>   bcache: remove member accessed from struct btree
>   bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan()
>   bcache: reap from tail of c->btree_cache in bch_mca_scan()
> 
>  drivers/md/bcache/btree.c | 24 ++++++++++--------------
>  drivers/md/bcache/btree.h |  2 --
>  2 files changed, 10 insertions(+), 16 deletions(-)
> 

Testing looks good and cherry picks/backports. Formally, for future submissions,
I would suggest to always group multi-series submissions together. In this case
this would have made the list two patches shorter and as a reviewer one does not
need to look twice. Things can be prepared with a bit of creative file naming
and adjusted subjects. For the subjects:

[PATCH 1/3 B] ...
[PATCH 1/3 F] ...
[PATCH 2/3 B/F] ...
[PATCH 3/3 B/F] ...

But this time,

Acked-by: Stefan Bader <stefan.bader at canonical.com>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20201021/e63cd218/attachment.sig>


More information about the kernel-team mailing list