APPLIED: [SRU][B][PATCH 0/6] btrfs: Automatic balance returns -EUCLEAN and leads to forced readonly filesystem

Fri Jul 16 15:37:01 UTC 2021

On 07.07.21 06:40, Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/1934709
> 
> [Impact]
> 
> During an automatic balance, users may encounter an error when writing the
> transaction log to disk, when the log tree is being parsed, which forces the
> filesystem to be remounted read-only and outputs the following kernel oops:
> 
> BTRFS: error (device dm-14) in balance_level:1962: errno=-117 unknown
> BTRFS info (device dm-14): forced readonly
> BTRFS: Transaction aborted (error -117)
> WARNING: CPU: 7 PID: 10818 at /build/linux-99Rib2/linux-4.15.0/fs/btrfs/tree-log.c:2908 btrfs_sync_log+0xa28/0xbc0 [btrfs]
> CPU: 7 PID: 10818 Comm: qemu-system-s39 Tainted: G           OE    4.15.0-136-generic #140-Ubuntu
> Hardware name: IBM 3907 LR1 A00 (LPAR)
> Krnl PSW : 0000000076bc1d64 000000009cc65255 (btrfs_sync_log+0xa28/0xbc0 [btrfs])
>             R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> Krnl GPRS: 007899a600000000 0000000000000006 0000000000000027 0000000000000007
>             000003ff801fdd8c 000000000002394c 00000053650a7000 ffffffffffffff8b
>             000000536b7f6000 00000053ffffff8b 00000053650a3800 0000005385935000
>             000000532054de01 0000005385935290 000003ff801fdd8c 0000004eebb1fae0
> Krnl Code: 000003ff801fdd80: c0200002a032	larl	%r2,000003ff80251de4
>             000003ff801fdd86: c0e5fffb2181	brasl	%r14,000003ff80162088
>            #000003ff801fdd8c: a7f40001		brc	15,000003ff801fdd8e
>            >000003ff801fdd90: e3a0f0a80004	lg	%r10,168(%r15)
>             000003ff801fdd96: b9040057		lgr	%r5,%r7
>             000003ff801fdd9a: a7490b5c		lghi	%r4,2908
>             000003ff801fdd9e: b904002a		lgr	%r2,%r10
>             000003ff801fdda2: c0300002604f	larl	%r3,000003ff80249e40
> Call Trace:
> ([<000003ff801fdd8c>] btrfs_sync_log+0xa24/0xbc0 [btrfs])
>   [<000003ff801c74e2>] btrfs_sync_file+0x3e2/0x550 [btrfs]
>   [<00000000003ce6ce>] do_fsync+0x5e/0x90
>   [<00000000003ce9ca>] SyS_fdatasync+0x32/0x48
>   [<00000000008ffe5c>] system_call+0xd8/0x2c8
> Last Breaking-Event-Address:
>   [<000003ff801fdd8c>] btrfs_sync_log+0xa24/0xbc0 [btrfs]
> BTRFS: error (device dm-14) in btrfs_sync_log:2908: errno=-117 unknown
> BTRFS error (device dm-14): pending csums is 269639680
> 
> This bug appears to be linked to bug 1933172, but is different and has a
> different root cause. Again, I believe that this is a regression introduced in
> the fixing of CVE-2019-19036, from 4.15.0-109-generic.
> 
> [Fix]
> 
> Analysis of the kernel oops is as follows:
> 
> The first thing we see is that BTRFS entered ERROR state with the reason:
> 
> in balance_level:1962: errno=-117 unknown
> 
> Now errno -117 is:
> 
> 100 #define EUCLEAN     117 /* Structure needs cleaning */
> 
> btrfs treats -EUCLEAN as if corruption has happened. Let's see where this is
> returned from.
> 
> If we start at fs/btrfs/ctree.c in balance_level(), line 1962:
> 
> 1917 static noinline int balance_level(struct btrfs_trans_handle *trans,
> 1918              struct btrfs_root *root,
> 1919              struct btrfs_path *path, int level)
> 1920 {
> ...
> 1958         /* promote the child to a root */
> 1959         child = read_node_slot(fs_info, mid, 0);
> 1960         if (IS_ERR(child)) {
> 1961             ret = PTR_ERR(child);
> 1962             btrfs_handle_fs_error(fs_info, ret, NULL);
> 1963             goto enospc;
> 1964         }
> ...
> 2136 }
> 
> We are in the middle of a balancing operation, and if you happen to be familiar
> with how b-tree data structures work, we are promoting a child node to a
> topmost root node.
> 
> The error most likely happens in read_node_slot(), with the lines after it
> printing that an error has happened.
> 
> 1887 static noinline struct extent_buffer *
> 1888 read_node_slot(struct btrfs_fs_info *fs_info, struct extent_buffer *parent,
> 1889            int slot)
> 1890 {
> ...
> 1900     btrfs_node_key_to_cpu(parent, &first_key, slot);
> 1901     eb = read_tree_block(fs_info, btrfs_node_blockptr(parent, slot),
> 1902                  btrfs_node_ptr_generation(parent, slot),
> 1903                  level - 1, &first_key);
> ...
> 1910 }
> 
> There are two calls here which are relevant. btrfs_node_key_to_cpu() and
> read_tree_block().
> 
> Let's look at read_tree_block() in fs/btrfs/disk-io.c:
> 
> 1147 struct extent_buffer *read_tree_block(struct btrfs_fs_info *fs_info, u64 bytenr,
> 1148                       u64 parent_transid, int level,
> 1149                       struct btrfs_key *first_key)
> 1150 {
> 1151     struct extent_buffer *buf = NULL;
> 1152     int ret;
> 1153
> 1154     buf = btrfs_find_create_tree_block(fs_info, bytenr);
> 1155     if (IS_ERR(buf))
> 1156         return buf;
> 1157
> 1158     ret = btree_read_extent_buffer_pages(fs_info, buf, parent_transid,
> 1159                          level, first_key);
> 1160     if (ret) {
> 1161         free_extent_buffer_stale(buf);
> 1162         return ERR_PTR(ret);
> 1163     }
> 1164     return buf;
> 1165
> 1166 }
> 
> The interesting one here is btree_read_extent_buffer_pages():
> 
> 498 static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
> 499                       struct extent_buffer *eb,
> 500                       u64 parent_transid, int level,
> 501                       struct btrfs_key *first_key)
> 502 {
> ...
> 511     while (1) {
> 512         clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
> 513         ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
> 514                            btree_get_extent, mirror_num);
> 515         if (!ret) {
> 516             if (verify_parent_transid(io_tree, eb,
> 517                            parent_transid, 0))
> 518                 ret = -EIO;
> 519             else if (verify_level_key(fs_info, eb, level,
> 520                           first_key))
> 521                 ret = -EUCLEAN;
> 522             else
> 523                 break;
> 524         }
> 525
> 526         /*
> 527          * This buffer's crc is fine, but its contents are corrupted, so
> 528          * there is no reason to read the other copies, they won't be
> 529          * any less wrong.
> 530          */
> 531         if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags) ||
> 532             ret == -EUCLEAN)
> 533             break;
> ...
> 
> If read_extent_buffer_pages() returns zero, we call verify_level_key(), and if
> this returns non-zero, we return -EUCLEAN.
> 
> verify_level_key() can fail on three conditions.
> 
>   440 static int verify_level_key(struct btrfs_fs_info *fs_info,
>   441                 struct extent_buffer *eb, int level,
>   442                 struct btrfs_key *first_key)
>   443 {
>   448     found_level = btrfs_header_level(eb);
>   449     if (found_level != level) {
> ...
>   456         return -EIO;
>   457     }
>   458
>   459     if (!first_key)
>   460         return 0;
>   461
>   462     /* We have @first_key, so this @eb must have at least one item */
>   463     if (btrfs_header_nritems(eb) == 0) {
>   464         btrfs_err(fs_info,
>   465         "invalid tree nritems, bytenr=%llu nritems=0 expect >0",
>   466               eb->start);
>   467         WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
>   468         return -EUCLEAN;
>   469     }
>   470
>   471     if (found_level)
>   472         btrfs_node_key_to_cpu(eb, &found_key, 0);
>   473     else
>   474         btrfs_item_key_to_cpu(eb, &found_key, 0);
>   475     ret = btrfs_comp_cpu_keys(first_key, &found_key);
> ...
>   487     return ret;
>   488 }
>   
> 1) If the eb level doesn't match the provided level.
> 2) If the eb has 0 items.
> 3) If the found_key doesn't match the first_key.
> 
> With the information we currently have, we don't know what one caused the
> problem.
> 
> I looked to see when verify_level_key() was first introduced. It seems it
> arrived in 4.15.0-109-generic through the SRU of CVE-2019-19036, with the
> commit:
> 
> ubuntu-bionic 50ab1ff51db0c5eb77ffc6f15ef32f07764f86ff
> Author: Qu Wenruo <wqu at suse.com>
> Date:   Thu Mar 29 09:08:11 2018 +0800
> Subject: btrfs: Validate child tree block's level and first key
> Link: https://paste.ubuntu.com/p/DQjTkfNRDt/
> 
> If you look at this particular hunk:
> 
> https://paste.ubuntu.com/p/z4qXw2Jp9K/
> 
> We see lines 18 - 26 of the above pastebin are introduced here.
> 
> Looking at the original upstream commit:
> 
> commit 581c1760415c48cca9349b198bba52dd38750765
> Author: Qu Wenruo <wqu at suse.com>
> Date:   Thu Mar 29 09:08:11 2018 +0800
> Subject: btrfs: Validate child tree block's level and first key
> Link: https://github.com/torvalds/linux/commit/581c1760415c48cca9349b198bba52dd38750765
> 
> Particularly the same hunk:
> 
> https://paste.ubuntu.com/p/vW3Y9BvK4C/
> 
> There is a subtle difference, the second if statement is extended with the
> ret == -EUCLEAN check, and not implemented entirely.
> 
> Why is this?
> 
> I looked up when the if statement was first introduced, and it was a very old
> commit from v2.6.39-rc1:
> 
> https://paste.ubuntu.com/p/vPT36xXq66/
> 
> Particularly this hunk:
> 
> https://paste.ubuntu.com/p/2jmQRSmVfc/
> 
> Interesting. Why is a commit from 4.15-109-generic re-implement something
> that should have been there since 2.6.39?
> 
> I checked upstream, and found the if statement to be removed entirely.
> 
> That is when I came across:
> 
> commit f8397d69daef06d358430d3054662fb597e37c00
> Author: Nikolay Borisov <nborisov at suse.com>
> Date:   Tue Nov 6 16:40:20 2018 +0200
> Subject: btrfs: Always try all copies when reading extent buffers
> Link: https://github.com/torvalds/linux/commit/f8397d69daef06d358430d3054662fb597e37c00
> 
> Which talks about balance operations failing out of the blue after a raid 1
> disk was added back to the array. The commit removes the if statement, and moves
> the location of the clear EXTENT_BUFFER_CORRUPT inside the while loop.
> 
> I checked the Bionic 4.15 kernel, and I found that this commit was applied
> in 4.15.0-56-generic:
> 
> ubuntu-bionic 03e1b5c9a1c1704e109466b375d09a4721b65ec5
> Author: Nikolay Borisov <nborisov at suse.com>
> Date:   Tue Nov 6 16:40:20 2018 +0200
> Subject: btrfs: Always try all copies when reading extent buffers
> Link: https://paste.ubuntu.com/p/TS2c5Mptr2/
> 
> It appears that the if statement was removed in 4.15.0-56-generic intentionally,
> and was brought back mistakenly in a backport of "btrfs: Validate child tree
> block's level and first key" 4.15.0-109-generic.
> 
> The root cause is likely some interaction between bug 1933172 and this bug,
> which leads to EXTENT_BUFFER_CORRUPT being set, or the incorrect first_key being
> set for the root node, which means we end up returning -EUCLEAN and aborting
> the transaction.
> 
> Unfortunately, this is the second bad backport of CVE-2019-19036.
> 
> The fix is to revert "btrfs: Validate child tree block's level and first key"
> and its dependency commit "btrfs: Detect unbalanced tree with empty leaf before
> crashing btree operations", and re-apply correct backports of these commits
> with that if statement removed, to keep the spirit of the already applied
> "btrfs: Always try all copies when reading extent buffers".
> 
> We will also add the below commits as they are fixup commits for
> "btrfs: Validate child tree block's level and first key".
> 
> commit 5d41be6f702f19f72db816c17175caf9dbdcdfa6
> Author: Qu Wenruo <wqu at suse.com>
> Date:   Fri Apr 13 06:32:47 2018 +0800
> Subject: btrfs: Only check first key for committed tree blocks
> https://github.com/torvalds/linux/commit/5d41be6f702f19f72db816c17175caf9dbdcdfa6
> 
> commit 17515f1b764df36271f3166c714f5a78301fbaa7
> Author: Qu Wenruo <wqu at suse.com>
> Date:   Mon Apr 23 17:32:04 2018 +0800
> Subject: btrfs: Fix wrong first_key parameter in replace_path
> Link: https://github.com/torvalds/linux/commit/17515f1b764df36271f3166c714f5a78301fbaa7
> 
> [Testcase]
> 
> Unfortunately, the customer did not image the affected filesystem and has since
> restored a backup ontop of it.
> 
> I have been attempting to reproduce this issue for some time, but I have not
> experienced the same call trace. I ran into bug 1933172 while trying to
> reproduce this bug.
> 
> I have been trying to balance nearly full btrfs filesystems, and I have looped
> xfstests btrfs/124 for hours attempting to trigger "btrfs: Always try all copies
> when reading extent buffers" but I haven't experienced a crash yet.
> 
> I have a test kernel in the following ppa:
> 
> https://launchpad.net/~mruffell/+archive/ubuntu/sf311164-test-2
> 
> If you install it, balances still complete as expected.
> 
> I will keep attempting to reproduce the issue, and will update this section if
> I manage to create a testcase.
> 
> For regression testing, I have run xfstests btrfs/* on both 4.15.0-136-generic
> and the test kernel, and they both share the same results:
> 
> 4.15.0-136-generic from -updates:
> 
> https://paste.ubuntu.com/p/4MpZ5YVMnv/
> 
> 4.15.0-136-generic test kernel from above ppa:
> 
> https://paste.ubuntu.com/p/CsxzbCkJmJ/
> 
> [Where problems could occur]
> 
> If a regression were to occur, it would affect users of btrfs filesystems, and
> would likely show during a routine balance operation.
> 
> I believe affected users would have nearly full filesystems, and would also
> experience filesystem corruption from bug 1933172, which would then cause the
> issues from this bug when the transaction log is written to disk.
> 
> With all modifications to btrfs, there is a risk of data corruption and
> filesystem corruption for all btrfs users, since balances happen automatically
> and on a regular basis. If a regression does happen, users should remount their
> filesystems with the "nobalance" flag, backup their data, and attempt a repair
> if necessary.
> 
> [Other info]
> 
> A community member has hit this issue before, and has reported it upstream
> to linux-btrfs here, although they never received a reply.
> 
> https://www.spinics.net/lists/linux-btrfs/msg112261.html
> 
> I have written to Richard and asked if he has any additional information that
> might help reproduce, but I have yet to receive a reply.
> 
> If you read Richard's mailing list link, it mentions filesystem corruption with
> missing extents. This suggests this crash might be linked to bug 1933172, which
> I came across while trying to reproduce the issue in this bug.
> 
> Matthew Ruffell (2):
>    Revert "btrfs: Detect unbalanced tree with empty leaf before crashing
>      btree operations"
>    Revert "btrfs: Validate child tree block's level and first key"
> 
> Qu Wenruo (4):
>    btrfs: Validate child tree block's level and first key
>    btrfs: Only check first key for committed tree blocks
>    btrfs: Fix wrong first_key parameter in replace_path
>    btrfs: Detect unbalanced tree with empty leaf before crashing btree
>      operations
> 
>   fs/btrfs/disk-io.c    | 18 +++++++++---------
>   fs/btrfs/relocation.c |  2 +-
>   2 files changed, 10 insertions(+), 10 deletions(-)
> 

Applied to bionic:linux.

Thanks,
Kleber