APPLIED[B/F/G]: [SRU] [B][F][G][H] [PATCH 0/1] bcache: consider the fragmentation when update the writeback rate

Fri Apr 2 22:39:11 UTC 2021

applied to B/F/G master-next. thank you! 

-Kelsey

On 2021-03-26 11:10:18 , Dongdong Tao wrote:
> From: dongdong tao <dongdong.tao at canonical.com>
> 
> BugLink: https://bugs.launchpad.net/bugs/1900438
> 
> SRU Justification:
> 
> [Impact]
> 
> This bug in bcache affects I/O performance on all versions of the kernel.
> It is particularly negative on ceph if used with bcache.
> 
> Write I/O latency would suddenly go to around 1 second from around 10 ms
> when hitting this issue and would easily be stuck there for hours or even days,
> especially bad for ceph on bcache architecture.
> This would make ceph extremely slow and make the entire cloud almost unusable. 
> 
> The root cause is that the dirty bucket had reached the 70 percent threshold,
> thus causing all writes to go direct to the backing HDD device. 
> It might be fine if it actually had a lot of dirty data, but this happens
> when dirty data has not even reached over 10 percent, due to having high memory fragmentation.
> What makes it worse is that the writeback rate might be still at minimum value (8) 
> due to the writeback percent not reached, so it takes ages for bcache to really reclaim
> enough dirty buckets to get itself out of this situation.
> 
> [Fix]
> 
> * 71dda2a5625f31bc3410cb69c3d31376a2b66f28 “bcache: consider the fragmentation when update the writeback rate”
> 
> The current way to calculate the writeback rate only considered the dirty sectors. 
> This usually works fine when memory fragmentation is not high, but it will give us an unreasonably low writeback rate when we are in the situation that a few dirty sectors have consumed a lot of dirty buckets. In some cases, the dirty buckets reached  CUTOFF_WRITEBACK_SYNC (i.e., stopped writeback)  while the dirty data (sectors) had not even reached the writeback_percent threshold (i.e., started writeback). In that situation, the writeback rate will still be the minimum value (8*512 = 4KB/s), thus it will cause all the writes to bestuck in a non-writeback mode because of the slow writeback.
> 
> We accelerate the rate in 3 stages with different aggressiveness:
> the first stage starts when dirty buckets percent reach above BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50), 
> the second is BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID (57),
> the third is BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64). 
> 
> By default the first stage tries to writeback the amount of dirty data
> in one bucket (on average) in (1 / (dirty_buckets_percent - 50)) seconds,
> the second stage tries to writeback the amount of dirty data in one bucket
> in (1 / (dirty_buckets_percent - 57)) * 100 milliseconds, the third
> stage tries to writeback the amount of dirty data in one bucket in
> (1 / (dirty_buckets_percent - 64)) milliseconds.
> 
> The initial rate at each stage can be controlled by 3 configurable
> parameters: 
> 
> writeback_rate_fp_term_{low|mid|high}
> 
> They are by default 1, 10, 1000, chosen based on testing and production data, detailed below.
> 
> A. When it comes to the low stage, it is still far from the 70%
>    threshold, so we only want to give it a little bit push by setting the
>    term to 1, it means the initial rate will be 170 if the fragment is 6,
>    it is calculated by bucket_size/fragment, this rate is very small,
>    but still much more reasonable than the minimum 8.
>    For a production bcache with non-heavy workload, if the cache device
>    is bigger than 1 TB, it may take hours to consume 1% buckets,
>    so it is very possible to reclaim enough dirty buckets in this stage,
>    thus to avoid entering the next stage.
> 
> B. If the dirty buckets ratio didn’t turn around during the first stage,
>    it comes to the mid stage, then it is necessary for mid stage
>    to be more aggressive than low stage, so the initial rate is chosen
>    to be 10 times more than the low stage, which means 1700 as the initial
>    rate if the fragment is 6. This is a normal rate
>    we usually see for a normal workload when writeback happens
>    because of writeback_percent.
> 
> C. If the dirty buckets ratio didn't turn around during the low and mid
>    stages, it comes to the third stage, and it is the last chance that
>    we can turn around to avoid the horrible cutoff writeback sync issue,
>    then we choose 100 times more aggressive than the mid stage, that
>    means 170000 as the initial rate if the fragment is 6. This is also
>    inferred from a production bcache, I've got one week's writeback rate
>    data from a production bcache which has quite heavy workloads,
>    again, the writeback is triggered by the writeback percent,
>    the highest rate area is around 100000 to 240000, so I believe this
>    kind aggressiveness at this stage is reasonable for production.
>    And it should be mostly enough because the hint is trying to reclaim
>    1000 bucket per second, and from that heavy production env,
>    it is consuming 50 buckets per second on average in one week's data.
> 
> Option writeback_consider_fragment is to control whether we want
> this feature to be on or off, it's on by default.
> 
> 
> [Test Plan]
> 
> I’ve put all my testing results in below google document, 
> the testing clearly shows the significant performance improvement.
> https://docs.google.com/document/d/1AmbIEa_2MhB9bqhC3rfga9tp7n9YX9PLn0jSUxscVW0/edit?usp=sharing
> 
> Another testing is that we had built a testing kernel based on bionic 4.15.0-99.100 + the patch,
> and putting this kernel in a production environment, it’s an openstack environment with ceph on bcache as the storage. 
> It runs for more than one month and doesn’t show any issues. 
> 
> [Where problems could occur]
> 
> The patch only updates the writeback rate, so it won’t have any impact on the data safety,
> the only potential regression I can think of is that the backing device might be a bit busier
> after the dirty buckets reached to BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW(50% by default) since
> the writeback rate is accelerated under this highly fragmented situation, but that’s because
> we are trying to avoid all writes hit the writeback cutoff sync threshold. 
> 
> [Other Info]
> 
> This SRU will cover ubuntu B,F,G,H releases, one patch for each of them.
> 
> dongdong tao (1):
>   bcache: consider the fragmentation when update the writeback rate
> 
>  drivers/md/bcache/bcache.h    |  4 ++++
>  drivers/md/bcache/sysfs.c     | 23 +++++++++++++++++++
>  drivers/md/bcache/writeback.c | 42 +++++++++++++++++++++++++++++++++++
>  drivers/md/bcache/writeback.h |  5 +++++
>  4 files changed, 74 insertions(+)
> 
> -- 
> 2.17.1
> 
> 
> -- 
> kernel-team mailing list
> kernel-team at lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team