APPLIED: [SRU][B][PATCH 0/6] btrfs: Automatic balance returns -EUCLEAN and leads to forced readonly filesystem
Kleber Souza
kleber.souza at canonical.com
Fri Jul 16 15:37:01 UTC 2021
On 07.07.21 06:40, Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/1934709
>
> [Impact]
>
> During an automatic balance, users may encounter an error when writing the
> transaction log to disk, when the log tree is being parsed, which forces the
> filesystem to be remounted read-only and outputs the following kernel oops:
>
> BTRFS: error (device dm-14) in balance_level:1962: errno=-117 unknown
> BTRFS info (device dm-14): forced readonly
> BTRFS: Transaction aborted (error -117)
> WARNING: CPU: 7 PID: 10818 at /build/linux-99Rib2/linux-4.15.0/fs/btrfs/tree-log.c:2908 btrfs_sync_log+0xa28/0xbc0 [btrfs]
> CPU: 7 PID: 10818 Comm: qemu-system-s39 Tainted: G OE 4.15.0-136-generic #140-Ubuntu
> Hardware name: IBM 3907 LR1 A00 (LPAR)
> Krnl PSW : 0000000076bc1d64 000000009cc65255 (btrfs_sync_log+0xa28/0xbc0 [btrfs])
> R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> Krnl GPRS: 007899a600000000 0000000000000006 0000000000000027 0000000000000007
> 000003ff801fdd8c 000000000002394c 00000053650a7000 ffffffffffffff8b
> 000000536b7f6000 00000053ffffff8b 00000053650a3800 0000005385935000
> 000000532054de01 0000005385935290 000003ff801fdd8c 0000004eebb1fae0
> Krnl Code: 000003ff801fdd80: c0200002a032 larl %r2,000003ff80251de4
> 000003ff801fdd86: c0e5fffb2181 brasl %r14,000003ff80162088
> #000003ff801fdd8c: a7f40001 brc 15,000003ff801fdd8e
> >000003ff801fdd90: e3a0f0a80004 lg %r10,168(%r15)
> 000003ff801fdd96: b9040057 lgr %r5,%r7
> 000003ff801fdd9a: a7490b5c lghi %r4,2908
> 000003ff801fdd9e: b904002a lgr %r2,%r10
> 000003ff801fdda2: c0300002604f larl %r3,000003ff80249e40
> Call Trace:
> ([<000003ff801fdd8c>] btrfs_sync_log+0xa24/0xbc0 [btrfs])
> [<000003ff801c74e2>] btrfs_sync_file+0x3e2/0x550 [btrfs]
> [<00000000003ce6ce>] do_fsync+0x5e/0x90
> [<00000000003ce9ca>] SyS_fdatasync+0x32/0x48
> [<00000000008ffe5c>] system_call+0xd8/0x2c8
> Last Breaking-Event-Address:
> [<000003ff801fdd8c>] btrfs_sync_log+0xa24/0xbc0 [btrfs]
> BTRFS: error (device dm-14) in btrfs_sync_log:2908: errno=-117 unknown
> BTRFS error (device dm-14): pending csums is 269639680
>
> This bug appears to be linked to bug 1933172, but is different and has a
> different root cause. Again, I believe that this is a regression introduced in
> the fixing of CVE-2019-19036, from 4.15.0-109-generic.
>
> [Fix]
>
> Analysis of the kernel oops is as follows:
>
> The first thing we see is that BTRFS entered ERROR state with the reason:
>
> in balance_level:1962: errno=-117 unknown
>
> Now errno -117 is:
>
> 100 #define EUCLEAN 117 /* Structure needs cleaning */
>
> btrfs treats -EUCLEAN as if corruption has happened. Let's see where this is
> returned from.
>
> If we start at fs/btrfs/ctree.c in balance_level(), line 1962:
>
> 1917 static noinline int balance_level(struct btrfs_trans_handle *trans,
> 1918 struct btrfs_root *root,
> 1919 struct btrfs_path *path, int level)
> 1920 {
> ...
> 1958 /* promote the child to a root */
> 1959 child = read_node_slot(fs_info, mid, 0);
> 1960 if (IS_ERR(child)) {
> 1961 ret = PTR_ERR(child);
> 1962 btrfs_handle_fs_error(fs_info, ret, NULL);
> 1963 goto enospc;
> 1964 }
> ...
> 2136 }
>
> We are in the middle of a balancing operation, and if you happen to be familiar
> with how b-tree data structures work, we are promoting a child node to a
> topmost root node.
>
> The error most likely happens in read_node_slot(), with the lines after it
> printing that an error has happened.
>
> 1887 static noinline struct extent_buffer *
> 1888 read_node_slot(struct btrfs_fs_info *fs_info, struct extent_buffer *parent,
> 1889 int slot)
> 1890 {
> ...
> 1900 btrfs_node_key_to_cpu(parent, &first_key, slot);
> 1901 eb = read_tree_block(fs_info, btrfs_node_blockptr(parent, slot),
> 1902 btrfs_node_ptr_generation(parent, slot),
> 1903 level - 1, &first_key);
> ...
> 1910 }
>
> There are two calls here which are relevant. btrfs_node_key_to_cpu() and
> read_tree_block().
>
> Let's look at read_tree_block() in fs/btrfs/disk-io.c:
>
> 1147 struct extent_buffer *read_tree_block(struct btrfs_fs_info *fs_info, u64 bytenr,
> 1148 u64 parent_transid, int level,
> 1149 struct btrfs_key *first_key)
> 1150 {
> 1151 struct extent_buffer *buf = NULL;
> 1152 int ret;
> 1153
> 1154 buf = btrfs_find_create_tree_block(fs_info, bytenr);
> 1155 if (IS_ERR(buf))
> 1156 return buf;
> 1157
> 1158 ret = btree_read_extent_buffer_pages(fs_info, buf, parent_transid,
> 1159 level, first_key);
> 1160 if (ret) {
> 1161 free_extent_buffer_stale(buf);
> 1162 return ERR_PTR(ret);
> 1163 }
> 1164 return buf;
> 1165
> 1166 }
>
> The interesting one here is btree_read_extent_buffer_pages():
>
> 498 static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
> 499 struct extent_buffer *eb,
> 500 u64 parent_transid, int level,
> 501 struct btrfs_key *first_key)
> 502 {
> ...
> 511 while (1) {
> 512 clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
> 513 ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
> 514 btree_get_extent, mirror_num);
> 515 if (!ret) {
> 516 if (verify_parent_transid(io_tree, eb,
> 517 parent_transid, 0))
> 518 ret = -EIO;
> 519 else if (verify_level_key(fs_info, eb, level,
> 520 first_key))
> 521 ret = -EUCLEAN;
> 522 else
> 523 break;
> 524 }
> 525
> 526 /*
> 527 * This buffer's crc is fine, but its contents are corrupted, so
> 528 * there is no reason to read the other copies, they won't be
> 529 * any less wrong.
> 530 */
> 531 if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags) ||
> 532 ret == -EUCLEAN)
> 533 break;
> ...
>
> If read_extent_buffer_pages() returns zero, we call verify_level_key(), and if
> this returns non-zero, we return -EUCLEAN.
>
> verify_level_key() can fail on three conditions.
>
> 440 static int verify_level_key(struct btrfs_fs_info *fs_info,
> 441 struct extent_buffer *eb, int level,
> 442 struct btrfs_key *first_key)
> 443 {
> 448 found_level = btrfs_header_level(eb);
> 449 if (found_level != level) {
> ...
> 456 return -EIO;
> 457 }
> 458
> 459 if (!first_key)
> 460 return 0;
> 461
> 462 /* We have @first_key, so this @eb must have at least one item */
> 463 if (btrfs_header_nritems(eb) == 0) {
> 464 btrfs_err(fs_info,
> 465 "invalid tree nritems, bytenr=%llu nritems=0 expect >0",
> 466 eb->start);
> 467 WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
> 468 return -EUCLEAN;
> 469 }
> 470
> 471 if (found_level)
> 472 btrfs_node_key_to_cpu(eb, &found_key, 0);
> 473 else
> 474 btrfs_item_key_to_cpu(eb, &found_key, 0);
> 475 ret = btrfs_comp_cpu_keys(first_key, &found_key);
> ...
> 487 return ret;
> 488 }
>
> 1) If the eb level doesn't match the provided level.
> 2) If the eb has 0 items.
> 3) If the found_key doesn't match the first_key.
>
> With the information we currently have, we don't know what one caused the
> problem.
>
> I looked to see when verify_level_key() was first introduced. It seems it
> arrived in 4.15.0-109-generic through the SRU of CVE-2019-19036, with the
> commit:
>
> ubuntu-bionic 50ab1ff51db0c5eb77ffc6f15ef32f07764f86ff
> Author: Qu Wenruo <wqu at suse.com>
> Date: Thu Mar 29 09:08:11 2018 +0800
> Subject: btrfs: Validate child tree block's level and first key
> Link: https://paste.ubuntu.com/p/DQjTkfNRDt/
>
> If you look at this particular hunk:
>
> https://paste.ubuntu.com/p/z4qXw2Jp9K/
>
> We see lines 18 - 26 of the above pastebin are introduced here.
>
> Looking at the original upstream commit:
>
> commit 581c1760415c48cca9349b198bba52dd38750765
> Author: Qu Wenruo <wqu at suse.com>
> Date: Thu Mar 29 09:08:11 2018 +0800
> Subject: btrfs: Validate child tree block's level and first key
> Link: https://github.com/torvalds/linux/commit/581c1760415c48cca9349b198bba52dd38750765
>
> Particularly the same hunk:
>
> https://paste.ubuntu.com/p/vW3Y9BvK4C/
>
> There is a subtle difference, the second if statement is extended with the
> ret == -EUCLEAN check, and not implemented entirely.
>
> Why is this?
>
> I looked up when the if statement was first introduced, and it was a very old
> commit from v2.6.39-rc1:
>
> https://paste.ubuntu.com/p/vPT36xXq66/
>
> Particularly this hunk:
>
> https://paste.ubuntu.com/p/2jmQRSmVfc/
>
> Interesting. Why is a commit from 4.15-109-generic re-implement something
> that should have been there since 2.6.39?
>
> I checked upstream, and found the if statement to be removed entirely.
>
> That is when I came across:
>
> commit f8397d69daef06d358430d3054662fb597e37c00
> Author: Nikolay Borisov <nborisov at suse.com>
> Date: Tue Nov 6 16:40:20 2018 +0200
> Subject: btrfs: Always try all copies when reading extent buffers
> Link: https://github.com/torvalds/linux/commit/f8397d69daef06d358430d3054662fb597e37c00
>
> Which talks about balance operations failing out of the blue after a raid 1
> disk was added back to the array. The commit removes the if statement, and moves
> the location of the clear EXTENT_BUFFER_CORRUPT inside the while loop.
>
> I checked the Bionic 4.15 kernel, and I found that this commit was applied
> in 4.15.0-56-generic:
>
> ubuntu-bionic 03e1b5c9a1c1704e109466b375d09a4721b65ec5
> Author: Nikolay Borisov <nborisov at suse.com>
> Date: Tue Nov 6 16:40:20 2018 +0200
> Subject: btrfs: Always try all copies when reading extent buffers
> Link: https://paste.ubuntu.com/p/TS2c5Mptr2/
>
> It appears that the if statement was removed in 4.15.0-56-generic intentionally,
> and was brought back mistakenly in a backport of "btrfs: Validate child tree
> block's level and first key" 4.15.0-109-generic.
>
> The root cause is likely some interaction between bug 1933172 and this bug,
> which leads to EXTENT_BUFFER_CORRUPT being set, or the incorrect first_key being
> set for the root node, which means we end up returning -EUCLEAN and aborting
> the transaction.
>
> Unfortunately, this is the second bad backport of CVE-2019-19036.
>
> The fix is to revert "btrfs: Validate child tree block's level and first key"
> and its dependency commit "btrfs: Detect unbalanced tree with empty leaf before
> crashing btree operations", and re-apply correct backports of these commits
> with that if statement removed, to keep the spirit of the already applied
> "btrfs: Always try all copies when reading extent buffers".
>
> We will also add the below commits as they are fixup commits for
> "btrfs: Validate child tree block's level and first key".
>
> commit 5d41be6f702f19f72db816c17175caf9dbdcdfa6
> Author: Qu Wenruo <wqu at suse.com>
> Date: Fri Apr 13 06:32:47 2018 +0800
> Subject: btrfs: Only check first key for committed tree blocks
> https://github.com/torvalds/linux/commit/5d41be6f702f19f72db816c17175caf9dbdcdfa6
>
> commit 17515f1b764df36271f3166c714f5a78301fbaa7
> Author: Qu Wenruo <wqu at suse.com>
> Date: Mon Apr 23 17:32:04 2018 +0800
> Subject: btrfs: Fix wrong first_key parameter in replace_path
> Link: https://github.com/torvalds/linux/commit/17515f1b764df36271f3166c714f5a78301fbaa7
>
> [Testcase]
>
> Unfortunately, the customer did not image the affected filesystem and has since
> restored a backup ontop of it.
>
> I have been attempting to reproduce this issue for some time, but I have not
> experienced the same call trace. I ran into bug 1933172 while trying to
> reproduce this bug.
>
> I have been trying to balance nearly full btrfs filesystems, and I have looped
> xfstests btrfs/124 for hours attempting to trigger "btrfs: Always try all copies
> when reading extent buffers" but I haven't experienced a crash yet.
>
> I have a test kernel in the following ppa:
>
> https://launchpad.net/~mruffell/+archive/ubuntu/sf311164-test-2
>
> If you install it, balances still complete as expected.
>
> I will keep attempting to reproduce the issue, and will update this section if
> I manage to create a testcase.
>
> For regression testing, I have run xfstests btrfs/* on both 4.15.0-136-generic
> and the test kernel, and they both share the same results:
>
> 4.15.0-136-generic from -updates:
>
> https://paste.ubuntu.com/p/4MpZ5YVMnv/
>
> 4.15.0-136-generic test kernel from above ppa:
>
> https://paste.ubuntu.com/p/CsxzbCkJmJ/
>
> [Where problems could occur]
>
> If a regression were to occur, it would affect users of btrfs filesystems, and
> would likely show during a routine balance operation.
>
> I believe affected users would have nearly full filesystems, and would also
> experience filesystem corruption from bug 1933172, which would then cause the
> issues from this bug when the transaction log is written to disk.
>
> With all modifications to btrfs, there is a risk of data corruption and
> filesystem corruption for all btrfs users, since balances happen automatically
> and on a regular basis. If a regression does happen, users should remount their
> filesystems with the "nobalance" flag, backup their data, and attempt a repair
> if necessary.
>
> [Other info]
>
> A community member has hit this issue before, and has reported it upstream
> to linux-btrfs here, although they never received a reply.
>
> https://www.spinics.net/lists/linux-btrfs/msg112261.html
>
> I have written to Richard and asked if he has any additional information that
> might help reproduce, but I have yet to receive a reply.
>
> If you read Richard's mailing list link, it mentions filesystem corruption with
> missing extents. This suggests this crash might be linked to bug 1933172, which
> I came across while trying to reproduce the issue in this bug.
>
> Matthew Ruffell (2):
> Revert "btrfs: Detect unbalanced tree with empty leaf before crashing
> btree operations"
> Revert "btrfs: Validate child tree block's level and first key"
>
> Qu Wenruo (4):
> btrfs: Validate child tree block's level and first key
> btrfs: Only check first key for committed tree blocks
> btrfs: Fix wrong first_key parameter in replace_path
> btrfs: Detect unbalanced tree with empty leaf before crashing btree
> operations
>
> fs/btrfs/disk-io.c | 18 +++++++++---------
> fs/btrfs/relocation.c | 2 +-
> 2 files changed, 10 insertions(+), 10 deletions(-)
>
Applied to bionic:linux.
Thanks,
Kleber
More information about the kernel-team
mailing list