ACK: [SRU] [B][F][G][H] [PATCH 0/1] bcache: consider the fragmentation when update the writeback rate

Fri Mar 26 12:04:30 UTC 2021

Acked-by: Tim Gardner <tim.gardner at canonical.com>

Nice work.

On 3/25/21 9:10 PM, Dongdong Tao wrote:
> From: dongdong tao <dongdong.tao at canonical.com>
> 
> BugLink: https://bugs.launchpad.net/bugs/1900438
> 
> SRU Justification:
> 
> [Impact]
> 
> This bug in bcache affects I/O performance on all versions of the
> kernel. It is particularly negative on ceph if used with bcache.
> 
> Write I/O latency would suddenly go to around 1 second from around 10
> ms when hitting this issue and would easily be stuck there for hours
> or even days, especially bad for ceph on bcache architecture. This
> would make ceph extremely slow and make the entire cloud almost
> unusable.
> 
> The root cause is that the dirty bucket had reached the 70 percent
> threshold, thus causing all writes to go direct to the backing HDD
> device. It might be fine if it actually had a lot of dirty data, but
> this happens when dirty data has not even reached over 10 percent,
> due to having high memory fragmentation. What makes it worse is that
> the writeback rate might be still at minimum value (8) due to the
> writeback percent not reached, so it takes ages for bcache to really
> reclaim enough dirty buckets to get itself out of this situation.
> 
> [Fix]
> 
> * 71dda2a5625f31bc3410cb69c3d31376a2b66f28 “bcache: consider the
> fragmentation when update the writeback rate”
> 
> The current way to calculate the writeback rate only considered the
> dirty sectors. This usually works fine when memory fragmentation is
> not high, but it will give us an unreasonably low writeback rate when
> we are in the situation that a few dirty sectors have consumed a lot
> of dirty buckets. In some cases, the dirty buckets reached
> CUTOFF_WRITEBACK_SYNC (i.e., stopped writeback)  while the dirty data
> (sectors) had not even reached the writeback_percent threshold (i.e.,
> started writeback). In that situation, the writeback rate will still
> be the minimum value (8*512 = 4KB/s), thus it will cause all the
> writes to bestuck in a non-writeback mode because of the slow
> writeback.
> 
> We accelerate the rate in 3 stages with different aggressiveness: the
> first stage starts when dirty buckets percent reach above
> BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50), the second is
> BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID (57), the third is
> BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64).
> 
> By default the first stage tries to writeback the amount of dirty
> data in one bucket (on average) in (1 / (dirty_buckets_percent - 50))
> seconds, the second stage tries to writeback the amount of dirty data
> in one bucket in (1 / (dirty_buckets_percent - 57)) * 100
> milliseconds, the third stage tries to writeback the amount of dirty
> data in one bucket in (1 / (dirty_buckets_percent - 64))
> milliseconds.
> 
> The initial rate at each stage can be controlled by 3 configurable 
> parameters:
> 
> writeback_rate_fp_term_{low|mid|high}
> 
> They are by default 1, 10, 1000, chosen based on testing and
> production data, detailed below.
> 
> A. When it comes to the low stage, it is still far from the 70% 
> threshold, so we only want to give it a little bit push by setting
> the term to 1, it means the initial rate will be 170 if the fragment
> is 6, it is calculated by bucket_size/fragment, this rate is very
> small, but still much more reasonable than the minimum 8. For a
> production bcache with non-heavy workload, if the cache device is
> bigger than 1 TB, it may take hours to consume 1% buckets, so it is
> very possible to reclaim enough dirty buckets in this stage, thus to
> avoid entering the next stage.
> 
> B. If the dirty buckets ratio didn’t turn around during the first
> stage, it comes to the mid stage, then it is necessary for mid stage 
> to be more aggressive than low stage, so the initial rate is chosen 
> to be 10 times more than the low stage, which means 1700 as the
> initial rate if the fragment is 6. This is a normal rate we usually
> see for a normal workload when writeback happens because of
> writeback_percent.
> 
> C. If the dirty buckets ratio didn't turn around during the low and
> mid stages, it comes to the third stage, and it is the last chance
> that we can turn around to avoid the horrible cutoff writeback sync
> issue, then we choose 100 times more aggressive than the mid stage,
> that means 170000 as the initial rate if the fragment is 6. This is
> also inferred from a production bcache, I've got one week's writeback
> rate data from a production bcache which has quite heavy workloads, 
> again, the writeback is triggered by the writeback percent, the
> highest rate area is around 100000 to 240000, so I believe this kind
> aggressiveness at this stage is reasonable for production. And it
> should be mostly enough because the hint is trying to reclaim 1000
> bucket per second, and from that heavy production env, it is
> consuming 50 buckets per second on average in one week's data.
> 
> Option writeback_consider_fragment is to control whether we want this
> feature to be on or off, it's on by default.
> 
> 
> [Test Plan]
> 
> I’ve put all my testing results in below google document, the testing
> clearly shows the significant performance improvement. 
> https://docs.google.com/document/d/1AmbIEa_2MhB9bqhC3rfga9tp7n9YX9PLn0jSUxscVW0/edit?usp=sharing
>
>  Another testing is that we had built a testing kernel based on
> bionic 4.15.0-99.100 + the patch, and putting this kernel in a
> production environment, it’s an openstack environment with ceph on
> bcache as the storage. It runs for more than one month and doesn’t
> show any issues.
> 
> [Where problems could occur]
> 
> The patch only updates the writeback rate, so it won’t have any
> impact on the data safety, the only potential regression I can think
> of is that the backing device might be a bit busier after the dirty
> buckets reached to BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW(50% by
> default) since the writeback rate is accelerated under this highly
> fragmented situation, but that’s because we are trying to avoid all
> writes hit the writeback cutoff sync threshold.
> 
> [Other Info]
> 
> This SRU will cover ubuntu B,F,G,H releases, one patch for each of
> them.
> 
> dongdong tao (1): bcache: consider the fragmentation when update the
> writeback rate
> 
> drivers/md/bcache/bcache.h    |  4 ++++ drivers/md/bcache/sysfs.c
> | 23 +++++++++++++++++++ drivers/md/bcache/writeback.c | 42
> +++++++++++++++++++++++++++++++++++ drivers/md/bcache/writeback.h |
> 5 +++++ 4 files changed, 74 insertions(+)
> 

-- 
-----------
Tim Gardner
Canonical, Inc