[Bug 1906476] Re: PANIC at zfs_znode.c:335:zfs_znode_sa_init() // VERIFY(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, &zp->z_sa_hdl)) failed

Wed Nov 17 10:36:47 UTC 2021

Thank you Christian,
I think I managed to repair my system.
Here is how I did, if it can help others.
By the way, Jonas, it is impossible to remove broken files/folders, so the strategy I suggest is to destroy the dataset and restore it from a backup, while running from a bootable media.
One can backup everything in the dataset except the corrupted files, and finally try to restore these by other means: reinstalling package, or using eventual backups for personnal files.

I scanned every dataset with find and fstat, as suggested in this thread, until fstat got stalled, for example with /var (I did it for /, /var, /opt and /home, which all had their own datasets):
```
sudo find /var -mount -exec echo '{}' \; -exec stat {} \;
```
At the same time I monitored kernel errors:
```
tail -f /var/log/kern.log
```
When it freezes on a file, its name is printed by the echo command (this is the last thing printed out), and a stack trace appears in the log.

I was lucky, only one file got corrupted: `/var/lib/app-
info/icons/ubuntu-impish-universe/48x48/plasma-workspace_preferences-
desktop-color.png`.

Each time a corrupted file is found, it is necessary to restart the scan from the beginning, while excluding it, example:
```
sudo find /var -mount -not -path '/var/lib/app-info/icons/ubuntu-impish-universe/*' -exec echo '{}' \; -exec stat {} \;
```

Apparently, my corrupted file did not belong to any package (I checked with `apt-file search <filepath>`), and in the end, I figured out it was recreated automatically, I don´t know how...
Otherwise, I would have reinstalled the package after restoring the rest.

I backed up the whole /var with tar:
```
sudo tar --exclude=/var/lib/app-info/icons/ubuntu-impish-universe/48x48 --acls --xattrs --numeric-owner --one-file-system -zcpvf backup_var.tar.gz /var
```
At first I did not put --numeric-owner, but the owners where all messed up, and it prevented it from going to graphical mode (GDM was complaining not the have write access to some /var/lib/gdm3/.config/)
It is probably because by default, owner/group are saved as text, and are assigned different uid/gid on the bootable media.

The backup process shall not get stalled, otherwise there might be other
corrupted files, not seen by fstat, I don't know if it is possible.

In order to be extra sure about non-corruption of my root dir (/), I
also created a backup of it, looking for a possible freeze, but it did
not occur.

I created a bootable USB media with Ubuntu 21.04, and booted it.
I accessed my ZFS pool:
```
sudo mkdir /mnt/install
sudo zpool import -f -R /mnt/install rpool
zfs list
```
I destroyed and recreated the dataset for /var (with options from my installation notes):
```
sudo zfs destroy -r rpool/root/var
sudo zfs create -o quota=16G -o mountpoint=/var rpool/root/var
```
It is necessary to reopen the pool, otherwise, a simple mount does not allow populating the new dataset (for me, /var was created in the root dataset):
```
sudo zpool export -a
sudo zpool import -R /mnt/install rpool
sudo zfs mount -l -a
zfs list
```

Now we can restore the backup:
```
sudo tar --acls --xattrs -zxpvf /home/user/backup_var.tar.gz -C /mnt/install
```
Check if the new dataset has the correct size and content:
```
zfs list
ll /mnt/install/var
```
Close and reboot:
```
sudo zfs umount -a
sudo reboot
```

Of course, it can get more complex if the corrupted files are more sensitive system files.
It might be necessary to chroot in order to reinstall the packages where corrupted files come from.

Hope it helps...

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to ubuntu-release-upgrader in
Ubuntu.
https://bugs.launchpad.net/bugs/1906476

Title:
  PANIC at zfs_znode.c:335:zfs_znode_sa_init() // VERIFY(0 ==
  sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED,
  &zp->z_sa_hdl)) failed

Status in Native ZFS for Linux:
  New
Status in linux package in Ubuntu:
  Invalid
Status in linux-raspi package in Ubuntu:
  Confirmed
Status in ubuntu-release-upgrader package in Ubuntu:
  Confirmed
Status in zfs-linux package in Ubuntu:
  Fix Released
Status in linux source package in Impish:
  Fix Released
Status in linux-raspi source package in Impish:
  Confirmed
Status in ubuntu-release-upgrader source package in Impish:
  Confirmed
Status in zfs-linux source package in Impish:
  Fix Released

Bug description:
  Since today while running Ubuntu 21.04 Hirsute I started getting a ZFS
  panic in the kernel log which was also hanging Disk I/O for all
  Chrome/Electron Apps.

  I have narrowed down a few important notes:
  - It does not happen with module version 0.8.4-1ubuntu11 built and included with 5.8.0-29-generic

  - It was happening when using zfs-dkms 0.8.4-1ubuntu16 built with DKMS
  on the same kernel and also on 5.8.18-acso (a custom kernel).

  - For whatever reason multiple Chrome/Electron apps were affected,
  specifically Discord, Chrome and Mattermost. In all cases they seem
  (but I was unable to strace the processes so it was a bit hard ot
  confirm 100% but by deduction from /proc/PID/fd and the hanging ls)
  they seem hung trying to open files in their 'Cache' directory, e.g.
  ~/.cache/google-chrome/Default/Cache and ~/.config/Mattermost/Cache ..
  while the issue was going on I could not list that directory either
  "ls" would just hang.

  - Once I removed zfs-dkms only to revert to the kernel built-in
  version it immediately worked without changing anything, removing
  files, etc.

  - It happened over multiple reboots and kernels every time, all my
  Chrome apps weren't working but for whatever reason nothing else
  seemed affected.

  - It would log a series of spl_panic dumps into kern.log that look like this:
  Dec  2 12:36:42 optane kernel: [   72.857033] VERIFY(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, &zp->z_sa_hdl)) failed
  Dec  2 12:36:42 optane kernel: [   72.857036] PANIC at zfs_znode.c:335:zfs_znode_sa_init()

  I could only find one other google reference to this issue, with 2 other users reporting the same error but on 20.04 here:
  https://github.com/openzfs/zfs/issues/10971

  - I was not experiencing the issue on 0.8.4-1ubuntu14 and fairly sure
  it was working on 0.8.4-1ubuntu15 but broken after upgrade to
  0.8.4-1ubuntu16. I will reinstall those zfs-dkms versions to verify
  that.

  There were a few originating call stacks but the first one I hit was

  Call Trace:
   dump_stack+0x74/0x95
   spl_dumpstack+0x29/0x2b [spl]
   spl_panic+0xd4/0xfc [spl]
   ? sa_cache_constructor+0x27/0x50 [zfs]
   ? _cond_resched+0x19/0x40
   ? mutex_lock+0x12/0x40
   ? dmu_buf_set_user_ie+0x54/0x80 [zfs]
   zfs_znode_sa_init+0xe0/0xf0 [zfs]
   zfs_znode_alloc+0x101/0x700 [zfs]
   ? arc_buf_fill+0x270/0xd30 [zfs]
   ? __cv_init+0x42/0x60 [spl]
   ? dnode_cons+0x28f/0x2a0 [zfs]
   ? _cond_resched+0x19/0x40
   ? _cond_resched+0x19/0x40
   ? mutex_lock+0x12/0x40
   ? aggsum_add+0x153/0x170 [zfs]
   ? spl_kmem_alloc_impl+0xd8/0x110 [spl]
   ? arc_space_consume+0x54/0xe0 [zfs]
   ? dbuf_read+0x4a0/0xb50 [zfs]
   ? _cond_resched+0x19/0x40
   ? mutex_lock+0x12/0x40
   ? dnode_rele_and_unlock+0x5a/0xc0 [zfs]
   ? _cond_resched+0x19/0x40
   ? mutex_lock+0x12/0x40
   ? dmu_object_info_from_dnode+0x84/0xb0 [zfs]
   zfs_zget+0x1c3/0x270 [zfs]
   ? dmu_buf_rele+0x3a/0x40 [zfs]
   zfs_dirent_lock+0x349/0x680 [zfs]
   zfs_dirlook+0x90/0x2a0 [zfs]
   ? zfs_zaccess+0x10c/0x480 [zfs]
   zfs_lookup+0x202/0x3b0 [zfs]
   zpl_lookup+0xca/0x1e0 [zfs]
   path_openat+0x6a2/0xfe0
   do_filp_open+0x9b/0x110
   ? __check_object_size+0xdb/0x1b0
   ? __alloc_fd+0x46/0x170
   do_sys_openat2+0x217/0x2d0
   ? do_sys_openat2+0x217/0x2d0
   do_sys_open+0x59/0x80
   __x64_sys_openat+0x20/0x30

To manage notifications about this bug go to:
https://bugs.launchpad.net/zfs/+bug/1906476/+subscriptions