[Bug 2028826] [NEW] 5.19.0-50 --- mdadm RAID 5 with write journal segfaults

blogten 2028826 at bugs.launchpad.net
Thu Jul 27 07:02:56 UTC 2023


Public bug reported:

After upgrading to kernel 5.19.0-50 for Ubuntu 22.04 LTS, up from 5.15
vintage kernels, the new kernel started pagefaulting after about 2 hours
of uptime.  The segfault is due to mdadm, and it relates a RAID 5 array
that has a write-through journal.  The RAID 5 array had 4 HDDs and a
journal device being itself a 32gb RAID 1 mdadm array consisting of
partitions on SSD devices.  The failure details from the syslog are
below.

Once this crash happens, the RAID array in question becomes
unresponsive.  The array cannot be stopped, and the reboot process will
not complete successfully.  After rebooting, mdadm will report 0 data
pages and hundreds of thousands of parity pages have to be recovered
from the journal.  It looks like there is no data loss, but it's hard to
tell obviously.

For reference, previously I had tried to use a write-back journal in the
same RAID 5 array.  With the earlier 5.15 vintage kernels, periodically
mdadm would hang and also prevent a successful reboot of the machine.
Upon restarting, mdadm would hang while trying to start the array until
I cleared out the write-back journal and added a fresh one.  This is
similar to bugs reported in the mdadm mailing list in 2020.  From that
point on, I only used write-through journals that appeared to work ok.
With the 5.19 kernel, the write-through journals started causing the
crash described here.  The present situation is similar to bugs reported
in the mdadm mailing list in May of this year.

I dropped the old RAID 5 array with a write journal and switched to a
RAID 6 array with an internal bitmap.


Jul 26 04:02:46 <redacted> kernel: [ 7093.186750] BUG: kernel NULL pointer dereference, address: 0000000000000155
Jul 26 04:02:46 <redacted> kernel: [ 7093.186769] #PF: supervisor read access in kernel mode
Jul 26 04:02:46 <redacted> kernel: [ 7093.186774] #PF: error_code(0x0000) - not-present page
Jul 26 04:02:46 <redacted> kernel: [ 7093.186778] PGD 0 P4D 0
Jul 26 04:02:46 <redacted> kernel: [ 7093.186785] Oops: 0000 [#1] PREEMPT SMP PTI
Jul 26 04:02:46 <redacted> kernel: [ 7093.186793] CPU: 4 PID: 5645 Comm: md126_raid5 Tainted: P           OE     5.19.0-50-generic #50-Ubuntu
Jul 26 04:02:46 <redacted> kernel: [ 7093.186800] Hardware name: Supermicro X9DAi/X9DAi, BIOS 3.0 08/05/2013
Jul 26 04:02:46 <redacted> kernel: [ 7093.186804] RIP: 0010:submit_bio_noacct+0x18f/0x620
Jul 26 04:02:46 <redacted> kernel: [ 7093.186819] Code: 8b 9c eb b8 00 00 00 0f 1f 44 00 00 41 80 7c 24 14 00 79 09 f6 83 50 01 00 00 04 74 2f 41 8b 44 24 10 83 e0 01 05 54 01 00 00 <0f> b6 1c 03 80 fb 01 0f 87 2e 56 82 00 83 e3 01 74 10 4c 89 e7 e8
Jul 26 04:02:46 <redacted> kernel: [ 7093.186824] RSP: 0018:ffff9aee4fc27cb0 EFLAGS: 00010206
Jul 26 04:02:46 <redacted> kernel: [ 7093.186830] RAX: 0000000000000155 RBX: 0000000000000000 RCX: 0000000000000000
Jul 26 04:02:46 <redacted> kernel: [ 7093.186834] RDX: 0000000000040001 RSI: 0000000000000000 RDI: ffff8a76132e70b8
Jul 26 04:02:46 <redacted> kernel: [ 7093.186839] RBP: ffff9aee4fc27ce0 R08: 0000000000000000 R09: 0000000000000000
Jul 26 04:02:46 <redacted> kernel: [ 7093.186842] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a76132e70b8
Jul 26 04:02:46 <redacted> kernel: [ 7093.186846] R13: ffff8a6e57d88000 R14: ffff8a6e57bad780 R15: 0000000003fef800
Jul 26 04:02:46 <redacted> kernel: [ 7093.186851] FS:  0000000000000000(0000) GS:ffff8a759fb00000(0000) knlGS:0000000000000000
Jul 26 04:02:46 <redacted> kernel: [ 7093.186856] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 26 04:02:46 <redacted> kernel: [ 7093.186860] CR2: 0000000000000155 CR3: 0000000396a10004 CR4: 00000000001706e0
Jul 26 04:02:46 <redacted> kernel: [ 7093.186865] Call Trace:
Jul 26 04:02:46 <redacted> kernel: [ 7093.186870]  <TASK>
Jul 26 04:02:46 <redacted> kernel: [ 7093.186876]  submit_bio+0x40/0xf0
Jul 26 04:02:46 <redacted> kernel: [ 7093.186890]  r5l_flush_stripe_to_raid+0x103/0x160 [raid456]
Jul 26 04:02:46 <redacted> kernel: [ 7093.186913]  handle_active_stripes.constprop.0+0x99/0x2a0 [raid456]
Jul 26 04:02:46 <redacted> kernel: [ 7093.186928]  ? md_wakeup_thread+0x2e/0x80
Jul 26 04:02:46 <redacted> kernel: [ 7093.186937]  raid5d+0x377/0x5e0 [raid456]
Jul 26 04:02:46 <redacted> kernel: [ 7093.186953]  ? schedule_timeout+0x122/0x160
Jul 26 04:02:46 <redacted> kernel: [ 7093.186964]  md_thread+0xad/0x170
Jul 26 04:02:46 <redacted> kernel: [ 7093.186971]  ? destroy_sched_domains_rcu+0x40/0x40
Jul 26 04:02:46 <redacted> kernel: [ 7093.186982]  ? md_set_read_only+0xa0/0xa0
Jul 26 04:02:46 <redacted> kernel: [ 7093.186988]  kthread+0xee/0x120
Jul 26 04:02:46 <redacted> kernel: [ 7093.186997]  ? kthread_complete_and_exit+0x20/0x20
Jul 26 04:02:46 <redacted> kernel: [ 7093.187006]  ret_from_fork+0x22/0x30
Jul 26 04:02:46 <redacted> kernel: [ 7093.187018]  </TASK>
Jul 26 04:02:46 <redacted> kernel: [ 7093.187021] Modules linked in: tls intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp nvidia_uvm(POE) coretemp snd_hda_codec_hdmi nvidia_drm(POE) nvidia_modeset(POE) bfq binfmt_misc nls_iso8859_1 snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg kvm_intel snd_intel_sdw_acpi snd_hda_codec nvidia(POE) snd_hda_core kvm snd_hwdep snd_pcm crct10dif_pclmul ghash_clmulni_intel snd_seq_midi aesni_intel snd_seq_midi_event snd_rawmidi crypto_simd cryptd rapl snd_seq drm_kms_helper intel_cstate snd_seq_device fb_sys_fops syscopyarea sysfillrect snd_timer sysimgblt serio_raw joydev input_leds mxm_wmi snd soundcore ioatdma mac_hid sch_fq_codel msr parport_pc ppdev lp drm parport ramoops reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 raid10 raid0 multipath linear hid_logitech_hidpp raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 hid_logitech_dj hid_generic

** Affects: mdadm (Ubuntu)
     Importance: Undecided
         Status: New

** Summary changed:

- 5.19.0-50 --- mdadm with write journal segfaults
+ 5.19.0-50 --- mdadm RAID 5 with write journal segfaults

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to mdadm in Ubuntu.
https://bugs.launchpad.net/bugs/2028826

Title:
  5.19.0-50 --- mdadm RAID 5 with write journal segfaults

Status in mdadm package in Ubuntu:
  New

Bug description:
  After upgrading to kernel 5.19.0-50 for Ubuntu 22.04 LTS, up from 5.15
  vintage kernels, the new kernel started pagefaulting after about 2
  hours of uptime.  The segfault is due to mdadm, and it relates a RAID
  5 array that has a write-through journal.  The RAID 5 array had 4 HDDs
  and a journal device being itself a 32gb RAID 1 mdadm array consisting
  of partitions on SSD devices.  The failure details from the syslog are
  below.

  Once this crash happens, the RAID array in question becomes
  unresponsive.  The array cannot be stopped, and the reboot process
  will not complete successfully.  After rebooting, mdadm will report 0
  data pages and hundreds of thousands of parity pages have to be
  recovered from the journal.  It looks like there is no data loss, but
  it's hard to tell obviously.

  For reference, previously I had tried to use a write-back journal in
  the same RAID 5 array.  With the earlier 5.15 vintage kernels,
  periodically mdadm would hang and also prevent a successful reboot of
  the machine.  Upon restarting, mdadm would hang while trying to start
  the array until I cleared out the write-back journal and added a fresh
  one.  This is similar to bugs reported in the mdadm mailing list in
  2020.  From that point on, I only used write-through journals that
  appeared to work ok.  With the 5.19 kernel, the write-through journals
  started causing the crash described here.  The present situation is
  similar to bugs reported in the mdadm mailing list in May of this
  year.

  I dropped the old RAID 5 array with a write journal and switched to a
  RAID 6 array with an internal bitmap.

  
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186750] BUG: kernel NULL pointer dereference, address: 0000000000000155
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186769] #PF: supervisor read access in kernel mode
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186774] #PF: error_code(0x0000) - not-present page
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186778] PGD 0 P4D 0
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186785] Oops: 0000 [#1] PREEMPT SMP PTI
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186793] CPU: 4 PID: 5645 Comm: md126_raid5 Tainted: P           OE     5.19.0-50-generic #50-Ubuntu
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186800] Hardware name: Supermicro X9DAi/X9DAi, BIOS 3.0 08/05/2013
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186804] RIP: 0010:submit_bio_noacct+0x18f/0x620
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186819] Code: 8b 9c eb b8 00 00 00 0f 1f 44 00 00 41 80 7c 24 14 00 79 09 f6 83 50 01 00 00 04 74 2f 41 8b 44 24 10 83 e0 01 05 54 01 00 00 <0f> b6 1c 03 80 fb 01 0f 87 2e 56 82 00 83 e3 01 74 10 4c 89 e7 e8
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186824] RSP: 0018:ffff9aee4fc27cb0 EFLAGS: 00010206
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186830] RAX: 0000000000000155 RBX: 0000000000000000 RCX: 0000000000000000
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186834] RDX: 0000000000040001 RSI: 0000000000000000 RDI: ffff8a76132e70b8
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186839] RBP: ffff9aee4fc27ce0 R08: 0000000000000000 R09: 0000000000000000
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186842] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a76132e70b8
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186846] R13: ffff8a6e57d88000 R14: ffff8a6e57bad780 R15: 0000000003fef800
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186851] FS:  0000000000000000(0000) GS:ffff8a759fb00000(0000) knlGS:0000000000000000
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186856] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186860] CR2: 0000000000000155 CR3: 0000000396a10004 CR4: 00000000001706e0
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186865] Call Trace:
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186870]  <TASK>
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186876]  submit_bio+0x40/0xf0
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186890]  r5l_flush_stripe_to_raid+0x103/0x160 [raid456]
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186913]  handle_active_stripes.constprop.0+0x99/0x2a0 [raid456]
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186928]  ? md_wakeup_thread+0x2e/0x80
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186937]  raid5d+0x377/0x5e0 [raid456]
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186953]  ? schedule_timeout+0x122/0x160
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186964]  md_thread+0xad/0x170
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186971]  ? destroy_sched_domains_rcu+0x40/0x40
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186982]  ? md_set_read_only+0xa0/0xa0
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186988]  kthread+0xee/0x120
  Jul 26 04:02:46 <redacted> kernel: [ 7093.186997]  ? kthread_complete_and_exit+0x20/0x20
  Jul 26 04:02:46 <redacted> kernel: [ 7093.187006]  ret_from_fork+0x22/0x30
  Jul 26 04:02:46 <redacted> kernel: [ 7093.187018]  </TASK>
  Jul 26 04:02:46 <redacted> kernel: [ 7093.187021] Modules linked in: tls intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp nvidia_uvm(POE) coretemp snd_hda_codec_hdmi nvidia_drm(POE) nvidia_modeset(POE) bfq binfmt_misc nls_iso8859_1 snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg kvm_intel snd_intel_sdw_acpi snd_hda_codec nvidia(POE) snd_hda_core kvm snd_hwdep snd_pcm crct10dif_pclmul ghash_clmulni_intel snd_seq_midi aesni_intel snd_seq_midi_event snd_rawmidi crypto_simd cryptd rapl snd_seq drm_kms_helper intel_cstate snd_seq_device fb_sys_fops syscopyarea sysfillrect snd_timer sysimgblt serio_raw joydev input_leds mxm_wmi snd soundcore ioatdma mac_hid sch_fq_codel msr parport_pc ppdev lp drm parport ramoops reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 raid10 raid0 multipath linear hid_logitech_hidpp raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 hid_logitech_dj hid_generic

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/mdadm/+bug/2028826/+subscriptions




More information about the foundations-bugs mailing list