APPLIED[B/F/G]: [SRU] [B][F][G][H] [PATCH 0/1] bcache: consider the fragmentation when update the writeback rate
Kelsey Skunberg
kelsey.skunberg at canonical.com
Fri Apr 2 22:39:11 UTC 2021
applied to B/F/G master-next. thank you!
-Kelsey
On 2021-03-26 11:10:18 , Dongdong Tao wrote:
> From: dongdong tao <dongdong.tao at canonical.com>
>
> BugLink: https://bugs.launchpad.net/bugs/1900438
>
> SRU Justification:
>
> [Impact]
>
> This bug in bcache affects I/O performance on all versions of the kernel.
> It is particularly negative on ceph if used with bcache.
>
> Write I/O latency would suddenly go to around 1 second from around 10 ms
> when hitting this issue and would easily be stuck there for hours or even days,
> especially bad for ceph on bcache architecture.
> This would make ceph extremely slow and make the entire cloud almost unusable.
>
> The root cause is that the dirty bucket had reached the 70 percent threshold,
> thus causing all writes to go direct to the backing HDD device.
> It might be fine if it actually had a lot of dirty data, but this happens
> when dirty data has not even reached over 10 percent, due to having high memory fragmentation.
> What makes it worse is that the writeback rate might be still at minimum value (8)
> due to the writeback percent not reached, so it takes ages for bcache to really reclaim
> enough dirty buckets to get itself out of this situation.
>
> [Fix]
>
> * 71dda2a5625f31bc3410cb69c3d31376a2b66f28 “bcache: consider the fragmentation when update the writeback rate”
>
> The current way to calculate the writeback rate only considered the dirty sectors.
> This usually works fine when memory fragmentation is not high, but it will give us an unreasonably low writeback rate when we are in the situation that a few dirty sectors have consumed a lot of dirty buckets. In some cases, the dirty buckets reached CUTOFF_WRITEBACK_SYNC (i.e., stopped writeback) while the dirty data (sectors) had not even reached the writeback_percent threshold (i.e., started writeback). In that situation, the writeback rate will still be the minimum value (8*512 = 4KB/s), thus it will cause all the writes to bestuck in a non-writeback mode because of the slow writeback.
>
> We accelerate the rate in 3 stages with different aggressiveness:
> the first stage starts when dirty buckets percent reach above BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50),
> the second is BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID (57),
> the third is BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64).
>
> By default the first stage tries to writeback the amount of dirty data
> in one bucket (on average) in (1 / (dirty_buckets_percent - 50)) seconds,
> the second stage tries to writeback the amount of dirty data in one bucket
> in (1 / (dirty_buckets_percent - 57)) * 100 milliseconds, the third
> stage tries to writeback the amount of dirty data in one bucket in
> (1 / (dirty_buckets_percent - 64)) milliseconds.
>
> The initial rate at each stage can be controlled by 3 configurable
> parameters:
>
> writeback_rate_fp_term_{low|mid|high}
>
> They are by default 1, 10, 1000, chosen based on testing and production data, detailed below.
>
> A. When it comes to the low stage, it is still far from the 70%
> threshold, so we only want to give it a little bit push by setting the
> term to 1, it means the initial rate will be 170 if the fragment is 6,
> it is calculated by bucket_size/fragment, this rate is very small,
> but still much more reasonable than the minimum 8.
> For a production bcache with non-heavy workload, if the cache device
> is bigger than 1 TB, it may take hours to consume 1% buckets,
> so it is very possible to reclaim enough dirty buckets in this stage,
> thus to avoid entering the next stage.
>
> B. If the dirty buckets ratio didn’t turn around during the first stage,
> it comes to the mid stage, then it is necessary for mid stage
> to be more aggressive than low stage, so the initial rate is chosen
> to be 10 times more than the low stage, which means 1700 as the initial
> rate if the fragment is 6. This is a normal rate
> we usually see for a normal workload when writeback happens
> because of writeback_percent.
>
> C. If the dirty buckets ratio didn't turn around during the low and mid
> stages, it comes to the third stage, and it is the last chance that
> we can turn around to avoid the horrible cutoff writeback sync issue,
> then we choose 100 times more aggressive than the mid stage, that
> means 170000 as the initial rate if the fragment is 6. This is also
> inferred from a production bcache, I've got one week's writeback rate
> data from a production bcache which has quite heavy workloads,
> again, the writeback is triggered by the writeback percent,
> the highest rate area is around 100000 to 240000, so I believe this
> kind aggressiveness at this stage is reasonable for production.
> And it should be mostly enough because the hint is trying to reclaim
> 1000 bucket per second, and from that heavy production env,
> it is consuming 50 buckets per second on average in one week's data.
>
> Option writeback_consider_fragment is to control whether we want
> this feature to be on or off, it's on by default.
>
>
> [Test Plan]
>
> I’ve put all my testing results in below google document,
> the testing clearly shows the significant performance improvement.
> https://docs.google.com/document/d/1AmbIEa_2MhB9bqhC3rfga9tp7n9YX9PLn0jSUxscVW0/edit?usp=sharing
>
> Another testing is that we had built a testing kernel based on bionic 4.15.0-99.100 + the patch,
> and putting this kernel in a production environment, it’s an openstack environment with ceph on bcache as the storage.
> It runs for more than one month and doesn’t show any issues.
>
> [Where problems could occur]
>
> The patch only updates the writeback rate, so it won’t have any impact on the data safety,
> the only potential regression I can think of is that the backing device might be a bit busier
> after the dirty buckets reached to BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW(50% by default) since
> the writeback rate is accelerated under this highly fragmented situation, but that’s because
> we are trying to avoid all writes hit the writeback cutoff sync threshold.
>
> [Other Info]
>
> This SRU will cover ubuntu B,F,G,H releases, one patch for each of them.
>
> dongdong tao (1):
> bcache: consider the fragmentation when update the writeback rate
>
> drivers/md/bcache/bcache.h | 4 ++++
> drivers/md/bcache/sysfs.c | 23 +++++++++++++++++++
> drivers/md/bcache/writeback.c | 42 +++++++++++++++++++++++++++++++++++
> drivers/md/bcache/writeback.h | 5 +++++
> 4 files changed, 74 insertions(+)
>
> --
> 2.17.1
>
>
> --
> kernel-team mailing list
> kernel-team at lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team
More information about the kernel-team
mailing list